{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyPBrkWjgmdf5ztz+GSsmmhe",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/LikeWind99/colab/blob/main/spark_nlp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 使用spark进行mini_newsgroups数据集的分类\n",
        "\n",
        "本次课程设计我们使用Google提供的云平台colab进行，课程设计的主要内容是使用pyspark和spark-nlp对经典的新闻数据集`mini_newsgroups`进行分类。\n",
        "\n",
        "Colaboratory 简称“Colab”，是 Google Research 团队开发的一款产品。在 Colab 中，任何人都可以通过浏览器编写和执行任意 Python 代码。它尤其适合机器学习、数据分析和教育目的。从技术上来说，Colab 是一种托管式 Jupyter 笔记本服务。用户无需设置，就可以直接使用，同时还能获得 GPU 等计算资源的免费使用权限。\n",
        "\n",
        "Google对colab进行了各种优化与适配，因此在配置开发环境时会十分方便，相较于在本地从头配置开发环境，使用colab无疑是更加方便的，同时由于其服务器在国外，当我们访问外网下载数据时，colab的下载速度很快的，综合考虑下本次课程项目全程在colab云端平台上进行。\n"
      ],
      "metadata": {
        "id": "lqd5zuUUaR4J"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 安装java，pyspark，spark-nlp以及它们依赖的库"
      ],
      "metadata": {
        "id": "EvHwuw0pb_b0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "16q4Gl0_f-kS",
        "outputId": "9835a3ce-4161-4a49-805b-1e7153dffcd6"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "openjdk version \"1.8.0_352\"\n",
            "OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~18.04-b08)\n",
            "OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting pyspark\n",
            "  Downloading pyspark-3.3.1.tar.gz (281.4 MB)\n",
            "\u001b[K     |████████████████████████████████| 281.4 MB 40 kB/s \n",
            "\u001b[?25hCollecting py4j==0.10.9.5\n",
            "  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)\n",
            "\u001b[K     |████████████████████████████████| 199 kB 70.1 MB/s \n",
            "\u001b[?25hBuilding wheels for collected packages: pyspark\n",
            "  Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=e729e30d3e2b087a316dee772c36038751a95ad88d84c6767e5d005efb5b95ec\n",
            "  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598\n",
            "Successfully built pyspark\n",
            "Installing collected packages: py4j, pyspark\n",
            "Successfully installed py4j-0.10.9.5 pyspark-3.3.1\n",
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting spark-nlp\n",
            "  Downloading spark_nlp-4.2.6-py2.py3-none-any.whl (453 kB)\n",
            "\u001b[K     |████████████████████████████████| 453 kB 5.0 MB/s \n",
            "\u001b[?25hInstalling collected packages: spark-nlp\n",
            "Successfully installed spark-nlp-4.2.6\n"
          ]
        }
      ],
      "source": [
        "## install java, pyspark and spark-nlp \n",
        "import os\n",
        "\n",
        "# Install java\n",
        "! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null\n",
        "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
        "os.environ[\"PATH\"] = os.environ[\"JAVA_HOME\"] + \"/bin:\" + os.environ[\"PATH\"]\n",
        "! java -version\n",
        "\n",
        "# Install pyspark\n",
        "! pip3 install --ignore-installed pyspark\n",
        "\n",
        "# Install Spark NLP\n",
        "! pip3 install --ignore-installed spark-nlp"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 下载数据集\n",
        "这里直接使用Linux的`wget`指令下载已有的`mini_newsgroups`数据集。原本是想使用分布式数据爬虫`scrapy-redis`在腾讯新闻网下载时事新闻的，但是colab并不支持多开虚拟机，因此放弃了这个想法，直接使用已经整理好的`mini_newsgroups`数据集。"
      ],
      "metadata": {
        "id": "qQYozDCwcO5W"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "! mkdir -p data\n",
        "! wget https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/mini_newsgroups.tar.gz"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "SnnlTV6TgIcx",
        "outputId": "df106dd9-fa79-471b-b3ee-5efe4108417b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-12-27 08:03:22--  https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/mini_newsgroups.tar.gz\n",
            "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n",
            "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 1860687 (1.8M) [application/x-httpd-php]\n",
            "Saving to: ‘mini_newsgroups.tar.gz’\n",
            "\n",
            "\rmini_newsgroups.tar   0%[                    ]       0  --.-KB/s               \rmini_newsgroups.tar 100%[===================>]   1.77M  --.-KB/s    in 0.07s   \n",
            "\n",
            "2022-12-27 08:03:22 (25.7 MB/s) - ‘mini_newsgroups.tar.gz’ saved [1860687/1860687]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 解压数据到data目录下\n",
        "! tar xzf mini_newsgroups.tar.gz -C ./data/"
      ],
      "metadata": {
        "id": "JssHdLEEgKLt"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "## import revelant pacakages\n",
        "import os\n",
        "import re\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "from pyspark.sql.types import *\n",
        "from pyspark.sql.functions import expr\n",
        "from pyspark.sql import Row\n",
        "from pyspark.ml import Pipeline\n",
        "\n",
        "import sparknlp\n",
        "from sparknlp import DocumentAssembler, Finisher\n",
        "from sparknlp.annotator import *\n",
        "\n",
        "%matplotlib inline\n",
        "\n",
        "# 此处使用sparknlp.start()来开启一个spark session\n",
        "spark = sparknlp.start()"
      ],
      "metadata": {
        "id": "6g8XJL8OgM3z"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "我们将构建一个分类器来识别文档来自哪个新闻组。但是新闻组在文档的标题中有提及，因此我们需要删除这些新闻组，仅保留文本。"
      ],
      "metadata": {
        "id": "q11XhU_OnPdJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 创建pattern\n",
        "HEADER_PTN = re.compile(r'^[a-zA-Z-]+:.*')\n",
        "\n",
        "def remove_header(path_text_pair):\n",
        "    path, text = path_text_pair\n",
        "    lines = text.split('\\n')\n",
        "    line_iterator = iter(lines)\n",
        "    while HEADER_PTN.match(next(line_iterator)) is not None:\n",
        "        pass\n",
        "    return path, '\\n'.join(line_iterator)"
      ],
      "metadata": {
        "id": "f09flKElgPKx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#####################################################################################################################\n",
        "# SparkContext是spark功能的主要入口。其代表与spark集群的连接，能够用来在集群上创建RDD、累加器、广播变量。\n",
        "# 每个JVM里只能存在一个处于激活状态的SparkContext，在创建新的SparkContext之前必须调用stop()来关闭之前的SparkContext。 \n",
        "#####################################################################################################################\n",
        "path = os.path.join('data', 'mini_newsgroups', '*')\n",
        "texts = spark.sparkContext.wholeTextFiles(path).map(remove_header)\n",
        "\n",
        "# 设置输入数据的数据类型\n",
        "schema = StructType([\n",
        "    StructField('path', StringType()),\n",
        "    StructField('text', StringType()),\n",
        "])\n",
        "\n",
        "# 创建DataFrame\n",
        "texts = spark.createDataFrame(texts, schema=schema) \\\n",
        "    .withColumn('newsgroup', expr('split(path, \"/\")[4]')) \\\n",
        "    .persist()"
      ],
      "metadata": {
        "id": "_VMjAAXwgRmM"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# 查看一下数据的结构\n",
        "texts.groupBy('newsgroup').count().collect()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uDe6bqmVgUMX",
        "outputId": "c7b9a228-32ea-4ddb-e009-848ddadeda46"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[Row(newsgroup='comp.windows.x', count=100),\n",
              " Row(newsgroup='misc.forsale', count=100),\n",
              " Row(newsgroup='rec.sport.hockey', count=100),\n",
              " Row(newsgroup='rec.sport.baseball', count=100),\n",
              " Row(newsgroup='comp.os.ms-windows.misc', count=100),\n",
              " Row(newsgroup='comp.sys.ibm.pc.hardware', count=100),\n",
              " Row(newsgroup='comp.graphics', count=100),\n",
              " Row(newsgroup='comp.sys.mac.hardware', count=100),\n",
              " Row(newsgroup='rec.motorcycles', count=100),\n",
              " Row(newsgroup='rec.autos', count=100),\n",
              " Row(newsgroup='alt.atheism', count=100),\n",
              " Row(newsgroup='sci.crypt', count=100),\n",
              " Row(newsgroup='talk.politics.guns', count=100),\n",
              " Row(newsgroup='talk.politics.misc', count=100),\n",
              " Row(newsgroup='soc.religion.christian', count=100),\n",
              " Row(newsgroup='talk.religion.misc', count=100),\n",
              " Row(newsgroup='talk.politics.mideast', count=100),\n",
              " Row(newsgroup='sci.electronics', count=100),\n",
              " Row(newsgroup='sci.space', count=100),\n",
              " Row(newsgroup='sci.med', count=100)]"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "可以看到我们的数据：一共20个分类，每个分类100条新闻"
      ],
      "metadata": {
        "id": "UaQdnP5JoRGW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(texts.first()['path'])\n",
        "print(texts.first()['newsgroup'])\n",
        "print(texts.first()['text'])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "n-NK76G5gWCe",
        "outputId": "eb190198-0946-4f2e-c4b0-195f54328bd3"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "file:/content/data/mini_newsgroups/alt.atheism/53633\n",
            "alt.atheism\n",
            "In article <1993Apr16.223250.15242@ncsu.edu> aiken@news.ncsu.edu (Wayne NMI Aiken) writes:\n",
            ">JSN104@psuvm.psu.edu wrote:\n",
            ">: YOU BLASHEPHEMERS!!! YOU WILL ALL GO TO HELL FOR NOT BELIEVING IN GOD!!!!  BE\n",
            ">: PREPARED FOR YOUR ETERNAL DAMNATION!!!\n",
            ">\n",
            ">Did someone leave their terminal unattended again?\n",
            ">\n",
            ">--\n",
            ">\n",
            ">Holy Temple of Mass  $   >>> slack@ncsu.edu <<<    $  \"My used underwear\n",
            ">   Consumption!      $                             $   is legal tender in\n",
            ">PO Box 30904         $     BBS: (919) 782-3095     $   28 countries!\"\n",
            ">Raleigh, NC  27622   $  Warning: I hoard pennies.  $     --\"Bob\"\n",
            "\n",
            "Probably not! The jesus freak's post is probably JSN104@PSUVM. Penn State\n",
            "is just loaded to the hilt with bible bangers. I use to go there *vomit* and\n",
            "it was the reason I left. They even had a group try to stop playing \n",
            "rock music in the dining halls one year cuz they deemed it satanic. Kampus\n",
            "Krusade for Khrist people run the damn place for the most part....except\n",
            "the Liberal Arts departments...they are the safe havens.\n",
            "-wdb\n",
            "\n",
            "v\n",
            "rock music in the dining\n",
            "t\n",
            "\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "![image.png]()\n",
        "\n",
        "可以看到，经过我们的预处理，已经把原始文本中附带的信息头给去除掉了，现在新闻中只剩下最原始的文本信息"
      ],
      "metadata": {
        "id": "JRx71-qcoyI7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 此处我们创建一个pipeline，对数据再进行一系列的分词，清洗，最后送入分类器进行分类\n",
        "assembler = DocumentAssembler()\\\n",
        "    .setInputCol('text')\\\n",
        "    .setOutputCol('document')\n",
        "sentence = SentenceDetector() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"sentences\")\n",
        "tokenizer = Tokenizer()\\\n",
        "    .setInputCols(['sentences'])\\\n",
        "    .setOutputCol('token')\n",
        "lemmatizer = LemmatizerModel.pretrained()\\\n",
        "    .setInputCols(['token'])\\\n",
        "    .setOutputCol('lemma')\n",
        "normalizer = Normalizer()\\\n",
        "    .setCleanupPatterns([\n",
        "        '[^a-zA-Z.-]+', \n",
        "        '^[^a-zA-Z]+', \n",
        "        '[^a-zA-Z]+$',\n",
        "    ])\\\n",
        "    .setInputCols(['lemma'])\\\n",
        "    .setOutputCol('normalized')\\\n",
        "    .setLowercase(True)\n",
        "finisher = Finisher()\\\n",
        "    .setInputCols(['normalized'])\\\n",
        "    .setOutputCols(['normalized'])\\\n",
        "    .setOutputAsArray(True)\n",
        "pipeline = Pipeline().setStages([\n",
        "    assembler, sentence, tokenizer, \n",
        "    lemmatizer, normalizer, finisher\n",
        "]).fit(texts)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GRXnH4jSgZOC",
        "outputId": "0d49caac-099a-4a87-813c-5ec3ac970415"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "lemma_antbnc download started this may take some time.\n",
            "Approximate size to download 907.6 KB\n",
            "[OK!]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "processed = pipeline.transform(texts).persist()"
      ],
      "metadata": {
        "id": "WBD1Dgffgchb"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# output the number of data length\n",
        "print(processed.count())"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "W_UTXe9LgeQi",
        "outputId": "daba67b4-97be-434c-d1ac-92c5713e0df2"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "2000\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "TF-IDF构建的文档向量是文档分类和回归中使用的最常见的特征类型。然而，使用这样的功能有一些困难。根据语料库的大小，我们得到的特征可能会超过几万个，而且对于任何一篇新闻应该都只有几百到几千个非零特征，所以创建整个词向量矩阵是不现实的。此处我们通过创建特征矩阵的稀疏表示来处理，其中省略count为0的值。"
      ],
      "metadata": {
        "id": "RVuYfFkAqL0q"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 构建词向量，并计算TF-IDF\n",
        "from pyspark.ml.feature import CountVectorizer, IDF\n",
        "\n",
        "count_vectorizer = CountVectorizer(\n",
        "    inputCol='normalized', outputCol='tf', minDF=10)\n",
        "idf = IDF(inputCol='tf', outputCol='tfidf', minDocFreq=10)\n",
        "\n",
        "bow_pipeline = Pipeline(stages=[count_vectorizer, idf])\n",
        "bow_pipeline = bow_pipeline.fit(processed)\n",
        "\n",
        "bows = bow_pipeline.transform(processed)"
      ],
      "metadata": {
        "id": "A4KNw7MCggHo"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "bows.limit(5).toPandas()[['tf', 'tfidf']]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "pMdDIynOgix0",
        "outputId": "efa337c2-7f6c-406b-dce2-1133d026c710"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                                  tf  \\\n",
              "0  (9.0, 6.0, 4.0, 1.0, 1.0, 1.0, 3.0, 5.0, 0.0, ...   \n",
              "1  (2.0, 7.0, 3.0, 3.0, 4.0, 1.0, 2.0, 1.0, 5.0, ...   \n",
              "2  (10.0, 32.0, 15.0, 12.0, 12.0, 2.0, 12.0, 5.0,...   \n",
              "3  (10.0, 11.0, 2.0, 2.0, 5.0, 3.0, 4.0, 3.0, 4.0...   \n",
              "4  (1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ...   \n",
              "\n",
              "                                               tfidf  \n",
              "0  (0.6576351108883792, 0.5491156405729176, 0.526...  \n",
              "1  (0.14614113575297316, 0.6406349140017372, 0.39...  \n",
              "2  (0.7307056787648658, 2.9286167497222273, 1.976...  \n",
              "3  (0.7307056787648658, 1.0067120077170157, 0.263...  \n",
              "4  (0.07307056787648658, 0.0915192734288196, 0.0,...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-d96950e5-a9f8-4fcf-a312-ebf3dd15a3d0\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>tf</th>\n",
              "      <th>tfidf</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>(9.0, 6.0, 4.0, 1.0, 1.0, 1.0, 3.0, 5.0, 0.0, ...</td>\n",
              "      <td>(0.6576351108883792, 0.5491156405729176, 0.526...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>(2.0, 7.0, 3.0, 3.0, 4.0, 1.0, 2.0, 1.0, 5.0, ...</td>\n",
              "      <td>(0.14614113575297316, 0.6406349140017372, 0.39...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>(10.0, 32.0, 15.0, 12.0, 12.0, 2.0, 12.0, 5.0,...</td>\n",
              "      <td>(0.7307056787648658, 2.9286167497222273, 1.976...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>(10.0, 11.0, 2.0, 2.0, 5.0, 3.0, 4.0, 3.0, 4.0...</td>\n",
              "      <td>(0.7307056787648658, 1.0067120077170157, 0.263...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>(1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ...</td>\n",
              "      <td>(0.07307056787648658, 0.0915192734288196, 0.0,...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d96950e5-a9f8-4fcf-a312-ebf3dd15a3d0')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-d96950e5-a9f8-4fcf-a312-ebf3dd15a3d0 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-d96950e5-a9f8-4fcf-a312-ebf3dd15a3d0');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "因为我们要做的是文本分类，而这里有很多类别，不同类别的新闻有不同的关键词，此处随机生成一些词汇并在我们处理好的数据中查询，看看数据分布的情况。这里我们使用spark-nlp提供的`RegexMatcher`来匹配关键字。 "
      ],
      "metadata": {
        "id": "oqQwoD5yrgPx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%writefile scifi_rules.tsv\n",
        "\\w+(lith|ant|an)ium,mineral\n",
        "(alien|cosmic|quantum|dimension(al)?),space_word"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9dXQSqp0gkoP",
        "outputId": "786ff664-8a97-40b6-d6cc-dd2c0aad3615"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Writing scifi_rules.tsv\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 使用正则表达式的方式进行文本查找\n",
        "regex_matcher = RegexMatcher() \\\n",
        "    .setOutputCol(\"regex\") \\\n",
        "    .setExternalRules('./scifi_rules.tsv', ',')"
      ],
      "metadata": {
        "id": "RAfQ3Sn0gnJV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "regex_finisher = Finisher()\\\n",
        "    .setInputCols(['regex'])\\\n",
        "    .setOutputCols(['regex'])\\\n",
        "    .setOutputAsArray(True)\n",
        "\n",
        "regex_rule_pipeline = Pipeline().setStages([\n",
        "    assembler, regex_matcher, regex_finisher\n",
        "]).fit(texts)\n",
        "\n",
        "regex_matches = regex_rule_pipeline.transform(texts)"
      ],
      "metadata": {
        "id": "zHiq3Le1gp-5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "regex_matches.orderBy(expr('size(regex)').desc())\\\n",
        "    .limit(5).toPandas()[['newsgroup', 'regex']]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "XWXe4QhWgr6S",
        "outputId": "dad61129-eb98-42e4-d78c-3eb1d34c0462"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "            newsgroup                                              regex\n",
              "0  talk.politics.guns  [alien, alien, alien, alien, alien, alien, alien]\n",
              "1       comp.graphics   [dimensional, dimension, dimensional, dimension]\n",
              "2           sci.space                         [quantum, quantum, cosmic]\n",
              "3           sci.space                           [cosmic, cosmic, cosmic]\n",
              "4             sci.med                  [dimensional, alien, dimensional]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-0e62e77f-deaf-4a33-a0f3-00652a42eb77\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>newsgroup</th>\n",
              "      <th>regex</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>talk.politics.guns</td>\n",
              "      <td>[alien, alien, alien, alien, alien, alien, alien]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>comp.graphics</td>\n",
              "      <td>[dimensional, dimension, dimensional, dimension]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>sci.space</td>\n",
              "      <td>[quantum, quantum, cosmic]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>sci.space</td>\n",
              "      <td>[cosmic, cosmic, cosmic]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>sci.med</td>\n",
              "      <td>[dimensional, alien, dimensional]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0e62e77f-deaf-4a33-a0f3-00652a42eb77')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-0e62e77f-deaf-4a33-a0f3-00652a42eb77 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-0e62e77f-deaf-4a33-a0f3-00652a42eb77');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "可以看到我们生成的关键字在相应的文章中是可以找到的"
      ],
      "metadata": {
        "id": "5p2MwEkzr0H4"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 去除无意义的停用词\n",
        "from pyspark.ml.feature import StopWordsRemover\n",
        "\n",
        "sw_remover = StopWordsRemover() \\\n",
        "    .setInputCol(\"normalized\") \\\n",
        "    .setOutputCol(\"filtered\") \\\n",
        "    .setStopWords(StopWordsRemover.loadDefaultStopWords(\"english\"))"
      ],
      "metadata": {
        "id": "RS_GQg_ogvFf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "接下来展示如何更新一下我们的pipeline，仅作为演示，项目实际使用的pipeline将从头构建，我们使用`setStages`方法来更新pipeline，将每一步需要完成的步骤放入stage列表中即可。"
      ],
      "metadata": {
        "id": "uqcNoZuQseNs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 更新我们的pipeline\n",
        "count_vectorizer = CountVectorizer(inputCol='filtered', \n",
        "    outputCol='tf', minDF=10)\n",
        "idf = IDF(inputCol='tf', outputCol='tfidf', minDocFreq=10)\n",
        "\n",
        "pipeline = Pipeline() \\\n",
        "    .setStages([\n",
        "        assembler, \n",
        "        sentence, \n",
        "        tokenizer, \n",
        "        lemmatizer, \n",
        "        normalizer, \n",
        "        finisher, \n",
        "        sw_remover,\n",
        "        count_vectorizer,\n",
        "        idf\n",
        "    ]) \\\n",
        "    .fit(texts)"
      ],
      "metadata": {
        "id": "gMnfmsLGgw25"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "features = pipeline.transform(texts).persist()"
      ],
      "metadata": {
        "id": "ljuodWZ9gzZv"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "features.printSchema()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "omav50r0g2GZ",
        "outputId": "da98eefa-22ad-4bae-f80d-d56d9777c9ef"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "root\n",
            " |-- path: string (nullable = true)\n",
            " |-- text: string (nullable = true)\n",
            " |-- newsgroup: string (nullable = true)\n",
            " |-- normalized: array (nullable = true)\n",
            " |    |-- element: string (containsNull = true)\n",
            " |-- filtered: array (nullable = true)\n",
            " |    |-- element: string (containsNull = true)\n",
            " |-- tf: vector (nullable = true)\n",
            " |-- tfidf: vector (nullable = true)\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "pipeline.stages"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Z22jUBCAg4D5",
        "outputId": "0dc1ecab-5efe-42ad-8667-ce97d0155ace"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[DocumentAssembler_e24082572e97,\n",
              " SentenceDetector_7eebabc51803,\n",
              " REGEX_TOKENIZER_2b61af83b6d9,\n",
              " LEMMATIZER_c62ad8f355f9,\n",
              " NORMALIZER_f0ca58f39a04,\n",
              " Finisher_8f7b6f94f328,\n",
              " StopWordsRemover_6379bd8e6251,\n",
              " CountVectorizerModel: uid=CountVectorizer_65e7a8e33c48, vocabularySize=3033,\n",
              " IDFModel: uid=IDF_13dbab20cb6c, numDocs=2000, numFeatures=3033]"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "cv_model = pipeline.stages[-2]"
      ],
      "metadata": {
        "id": "JsZ7y3nSg6AH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "接下来我们可以查看一下我们构建的词汇表的信息，包括词汇表的大小，以及出现最多的词汇是哪些"
      ],
      "metadata": {
        "id": "bKJHD4CvsunA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 打印词汇表的大小\n",
        "len(cv_model.vocabulary)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VY2sIvvag77m",
        "outputId": "6642b84a-3b8e-45d8-8e2d-48c5983fc27c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "3033"
            ]
          },
          "metadata": {},
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 查看词汇表的一些信息\n",
        "cv_model.vocabulary[:10]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aFYrngmgg990",
        "outputId": "e8771800-03d3-4eee-c0ea-befc1338396b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['write', 'one', 'use', 'get', 'article', 'say', 'know', 'x', 'make', 'dont']"
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "tf = features.select('tf').toPandas()\n",
        "tf = tf['tf'].apply(lambda sv: sv.toArray())\n",
        "mean_tf = pd.Series(tf.mean(), index=cv_model.vocabulary)"
      ],
      "metadata": {
        "id": "XmNiCOMAg_ca"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "plt.figure(figsize=(12, 8))\n",
        "mean_tf.hist(bins=10)\n",
        "plt.title('Histogram of mean term frequency per word over the corpus')\n",
        "plt.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 499
        },
        "id": "23OB3qHMhBYL",
        "outputId": "cb9d9560-350f-4fa7-e89b-906c1dc27a7b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 864x576 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAHiCAYAAAD8n5rBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dfbhkV10n+u+PNERIeAkEe/IGjdxmxkTmAvZAHJzxIAIh6gSuIyTDSxoj8QVUNKNEriMIMhOvgqN3HCRIhjchRBTpIXFyY+SQQQgmaAQSYNKSxs4LiZAEaKJohnX/2PuEyrHOOpXu03VOuj+f56mna6+9aq+1d62u861da1dVay0AAMB091nvDgAAwEYmMAMAQIfADAAAHQIzAAB0CMwAANAhMAMAQIfADB1VdXVVLax3P9ZTVT27qnZX1Z6qevx69+fepKp+uaq+UFWfX+++sHeqaqGqrl/vfsyiqt5SVb+83v2AA5HAzEGrqnZV1fcsK9teVR9aWm6tndBaW1xlO1uqqlXVpv3U1fX2a0le2lo7vLX2F+vdmVlNe37n3P4jkpyV5PjW2j9Zr35wYFr+WgXsXwIzbHAbIIg/MsnV69yHuarBvr4+PiLJF1trt6zQxno/r/ca8zpWG/U52aj9Ws29td8wjcAMHZNnKavqiVV1ZVV9uapurqrXj9UuG/+9fZy28B1VdZ+q+oWq+lxV3VJVb6uqB09s94Xjui9W1X9Y1s6rquo9VfWOqvpyku1j2x+pqtur6qaq+i9Vdb+J7bWq+vGquraqvlJVr6mqR1fVh8f+XjBZf9k+Tu1rVR1aVXuSHJLkL6vqr1Z4/D1qu6q+r6quGvflw1X1zyfWnV1VfzVu55qqevbEuu1V9aGq+rWquq2qrquqZ67Qp7dnCKz/fXxOfm4sP3Fs8/aq+svJ6TZVtVhVr62qP01yR5Jv2dvjOj6XlyQ5emz/LROfRJxRVX+d5E/Guj9UVZ8a9+niqnrkxHaeVlWfrqovjc/5B6vqhyfGyTsm6t7tk47xOXzzOF5uqGF6yCGzHMuqemhV/bequnFc/4dj+Ser6vsn6t23hikn/2iqTo1TGarqFWOdXVX1vIn1h47t/3UN/59+u6ruv+yxL69hOst/m7L9z1XVt4/3nzfu+wnj8hkTfT60qv7zuC83jvcPXamdqrr/+HzdVlXXJPkXU4bYZD/+ZVVdMT5HV1TVvxzLn1tVVy6r+9NVtWNf97+qvjXJbyf5jnF83T6x+oiqunAcrx+tqkdPPO6fVdUlVXVrVX2mqp7T2a+pY2Bc9+Kq2jluZ0dVHT2xrlXVS6rq2iTXTpT9ZFV9dhwLv1rjG9IZxvH28XFfGcfpXWMI5qq15uZ2UN6S7EryPcvKtif50LQ6ST6S5AXj/cOTnDje35KkJdk08bgfSrIzybeMdf8gydvHdccn2ZPkO5PcL8OUh3+YaOdV4/KzMrypvX+Sb09yYpJNY3ufSvKyifZakvcleVCSE5J8LcmlY/sPTnJNktNXOA4r9nVi2/9H5zjO3HaSxye5JcmTMgTx08djfOi4/geTHD3u93OTfDXJURPPzT8kefH42B9LcmOSmuX5TXJMki8mOXnc/tPG5YeP6xeT/PW4D5uS3Hcfj+tCkusnlpfGyduSHDY+r6eMx/5bxzZ/IcmHx/pHJvlKkn879uWnk9yZ5Icnxsk7pmx/07j83iRvHNv65iR/luRHZjmWSS5M8u4kR4xtf9dY/nNJ3j3R5ilJPtHZ/zuTvD7JoUm+a3w+/+m4/teT7Ejy0CQPTPLfk/ynZY/9lfGx95+y/bclOWu8f26Sv0ryYxPrfnq8/+okl4/H4OFJPpzkNSu1k+ScJP9z7NdxST45+Twu68NDk9yW5AXj83fauPywJA8Yn7+tE/WvSHLqGu3/9ky8Vo1lb8kwpp849ud3k5w/rjssye4kLxrXPT7JFzJMGZq2byuNge8eH/eEsW//b5LLlr0eXDLu1/0nyj4wlj0iyf/KDON47POX840xc1SSE/bn3wU3t5Vu694BN7f1umUIVHuS3D5xuyMrB+bLkvxSkiOXbeeuF/iJskuT/PjE8j/NEFA2JfnFJO+aWPeAJH+fuwfmy1bp+8uSvHdiuSV58sTyx5K8fGL5dUn+8wrbWrGvE9teLTDP1HaSN2QMKxPrP7P0x3jKtq9Kcsp4f3uSncuOW0vyTzrP72Rgfnkm3giMZRfnG2F+Mcmr93bfprS/kOmB+Vsmyv4oyRkTy/cZx+Ajk7wwyeUT6yrJ9ZktaGzOEO7vP7H+tCQfWO1YZgglX09yxJR9OjpDCHzQuPyeJD/X2f87kxw2UXZBkv8w7stXkzx6Yt13JLlu4rF/n+SbOuPujCQ7xvufSvLD+UY4/FySJ4z3/yrJyROPe0aSXSu1k+SzSU6aWD4zKwfmFyT5s2VlH0myfbz/jiS/ON7fOh67B6zR/m/P9MD8OxPLJyf59Hj/uUn+57L6b0zyyinb7o2BNyf5fyaWD8/werFl4v/Md0/5fzR5TH88yaUzjOPDMrwu/0CmvGlwc5vnzZQMDnbPaq09ZOmW4YV8JWckeUyST48fvX5fp+7RGf5oL/lcvhFkjs5wpidJ0lq7I8NZoUm7Jxeq6jFV9f6q+nwN0zT+Y4YzkJNunrj/t1OWD9+Lvs5q1rYfmeSsGqZE3D5+lHzc2IelqSpXTaz7ttx9P+/6tonxuKWzX8s9MskPLmv7OzOEgyW7pzxub4/rSibbeGSS35joz60ZwtQx+cfjpK3Qv2kemeGs4E0T235jhrOsS1Y6lsclubW1dtvyjbbWbkzyp0l+oKoekuSZGc5iruS21tpXJ5Y/N+7XwzMEx49N9O9/jOVL/qa19nedbX8wyb+qqqMynCW/IMmTq2pLhrP/V431po3voyeWl7dzt+O+7LHLLd/2Uv1jxvvvzPBGJUn+XZI/HI/1Wuz/Sia/keWO3P3/3pOWjf/nZXiTtNyKYyDL9rm1tifD69cxE3WmjdPlx/ToKXXuZhw7z03yoxnG8oVV9c9WexzsDwIzzKi1dm1r7bQMoeNXkrynqg7LcDZkuRsz/IFa8ogMZ9tuTnJTkmOXVozzFh+2vLlly29I8ukMH+8+KMkrMgSrtdDr61rbneS1k29SWmsPaK29q4a5u29K8tIkDxvfwHwye7+fy4/h7gxnmCfbPqy1dk7nMfvDZBu7M0yTmOzT/VtrH84wTo5bqlhVNbmc4QzlAyaWJ4PP7gxnmI+c2O6DWmsnzNC/3UkeOgbiad6a5PkZps98pLV2Q2dbR4z/R5Y8IsN4+0KGNxsnTPTvwa21yTcf3eeitbYzQyD8iQyfyHw5Q1g8M8OZ16+PVaeN7xs77dztuI/1V7J820v1l47JJUkeXlWPyxCc3zmW7/P+z7B+ud1JPrhsrB3eWvuxFequNAbuts/j8/uwfGOfV+rb8mO69Bz0xnFaaxe31p6W4Y3tpzO8RsDcCcwwo6p6flU9fPxDvHSRzdeT/M3477dMVH9Xkp+uqkdV1eEZzgi/u7V2Z4aPsb9/vFjofhk+klwtFD4ww1y+PeMZlml/5PZWr69r7U1JfrSqnlSDw6rqe6vqgRk+fm0Zjmeq6kUZzjDvrZtz9+fkHRmO+zOq6pCq+qbx4qpjV3j8PPx2kp+vb1ys9uCq+sFx3YVJTqiq/2u8AOonc/cwcVWSf11Vj6jhgtKfX1rRWrspyf+X5HVV9aAaLux8dFV912odGh/7R0n+a1UdUcOFff96osofZpi/+lMZ5gqv5peq6n5V9a+SfF+S3xv/D70pya9X1TeP+35MVT1jhu1N+mCGN1gfHJcXly0nw/j+hap6eFUdmWFK1DuysgsyPCdHjGPjJzp1L0rymKr6d1W1qaqem+EahfcnSWvtH5L8XpJfzTB/95KxfC32/+Ykx9YKF/NO8f6xry8Yn9P7VtW/qOECwrtZZQy8K8mLqupxNVw8+R+TfLS1tmuV9n923NZxGcbOu8fyFcdxVW2uqlPGUP61DFPovr58wzAPAjPM7qQkV9fwzRG/keHinb8dP2J9bZI/HT/qPDHJeUnenmHe83VJ/i7jH97W2tXj/fMznM3ak+FCuK912v73GT7S/UqGP7Tv7tS9p1bs61prrV2Z4UKz/5Lh4qidGeZiprV2TYY5wR/JEAYem+Hj/731nzIEpdur6t+31nZnuEjtFRlC+e4kP5t1fB1srb03w6cV549TbT6ZYZpDWmtfyHAW95wMH3lvzcTxaK1dkmEcfDzD3Or3L9v8CzNcVHpNhmP9ntx9+knPCzLMS/10hrH5sol2/zbJ7yd5VIYLRHs+P7Z9Y4apGz/aWvv0uO7lGZ7/y8d9/+MM8+fviQ9meDN52QrLSfLLSa7McJw+keTPx7KV/FKGKQPXZXjT8faVKrbWvpjhTcBZGZ6jn0vyfeNzt+SdSb4nwxuFyTeh+7r/f5Lh6x4/X1VfWK1ya+0rSZ6e5NQMz8fn842LCqeZOgZaa3+cYR7672d4/Xr0uM3VvC/DOL0qw5vBN4/b643j+yT5mbG/t2a4cHQtTxbAzJauiAbWyXhW9/YM0y2uW+/+sHFV1WKGC6R+Z5378YtJHtNae36nzkKGvq7nGXw2gKpqGV7fdq53X2BvOcMM66Cqvr+qHjB+1PhrGc587VrfXsHqquqhGS6APXe9+wIwLwIzrI9TMnzMeGOGj9pPbT7uYYOrqhdnmMryR621y1arD3CgMCUDAAA6nGEGAIAOgRkAADo2rXcHeo488si2ZcuWubX31a9+NYcddtjqFTloGBNMY1wwjXHBNMbFvcfHPvaxL7TWHj5t3YYOzFu2bMmVV145t/YWFxezsLAwt/bY+IwJpjEumMa4YBrj4t6jqpb/1P1dTMkAAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6Ni03h3YqLacfeF6d2Hudp3zvevdBQCADccZZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgY9XAXFXHVdUHquqaqrq6qn5qLH9VVd1QVVeNt5MnHvPzVbWzqj5TVc+YKD9pLNtZVWfvn10CAIC1s2mGOncmOau19udV9cAkH6uqS8Z1v95a+7XJylV1fJJTk5yQ5Ogkf1xVjxlX/1aSpyW5PskVVbWjtXbNWuwIAADsD6sG5tbaTUluGu9/pao+leSYzkNOSXJ+a+1rSa6rqp1Jnjiu29la+2ySVNX5Y12BGQCADWuWM8x3qaotSR6f5KNJnpzkpVX1wiRXZjgLfVuGMH35xMOuzzcC9u5l5U+a0saZSc5Mks2bN2dxcfGedHGf7Nmz5672znrsnXNrd6OY57G+t5gcE7DEuGAa44JpjIsDw8yBuaoOT/L7SV7WWvtyVb0hyWuStPHf1yX5oX3tUGvt3CTnJsm2bdvawsLCvm5yZouLi1lqb/vZF86t3Y1i1/MW1rsLG87kmIAlxgXTGBdMY1wcGGYKzFV13wxh+Xdba3+QJK21myfWvynJ+8fFG5IcN/HwY8eydMoBAGBDmuVbMirJm5N8qrX2+onyoyaqPTvJJ8f7O5KcWlWHVtWjkmxN8mdJrkiytaoeVVX3y3Bh4I612Q0AANg/ZjnD/OQkL0jyiaq6aix7RZLTqupxGaZk7EryI0nSWru6qi7IcDHfnUle0lr730lSVS9NcnGSQ5Kc11q7eg33BQAA1tws35LxoSQ1ZdVFnce8Nslrp5Rf1HscAABsNH7pDwAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgI5VA3NVHVdVH6iqa6rq6qr6qbH8oVV1SVVdO/57xFheVfWbVbWzqj5eVU+Y2NbpY/1rq+r0/bdbAACwNmY5w3xnkrNaa8cnOTHJS6rq+CRnJ7m0tbY1yaXjcpI8M8nW8XZmkjckQ8BO8sokT0ryxCSvXArZAACwUa0amFtrN7XW/ny8/5Ukn0pyTJJTkrx1rPbWJM8a75+S5G1tcHmSh1TVUUmekeSS1tqtrbXbklyS5KQ13RsAAFhj92gOc1VtSfL4JB9Nsrm1dtO46vNJNo/3j0mye+Jh149lK5UDAMCGtWnWilV1eJLfT/Ky1tqXq+quda21VlVtLTpUVWdmmMqRzZs3Z3FxcS02O5M9e/bc1d5Zj71zbu1uFPM81vcWk2MClhgXTGNcMI1xcWCYKTBX1X0zhOXfba39wVh8c1Ud1Vq7aZxycctYfkOS4yYefuxYdkOShWXli8vbaq2dm+TcJNm2bVtbWFhYXmW/WVxczFJ728++cG7tbhS7nrew3l3YcCbHBCwxLpjGuGAa4+LAMMu3ZFSSNyf5VGvt9ROrdiRZ+qaL05O8b6L8heO3ZZyY5Evj1I2Lkzy9qo4YL/Z7+lgGAAAb1ixnmJ+c5AVJPlFVV41lr0hyTpILquqMJJ9L8pxx3UVJTk6yM8kdSV6UJK21W6vqNUmuGOu9urV265rsBQAA7CerBubW2oeS1AqrnzqlfkvykhW2dV6S8+5JBwEAYD35pT8AAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOgQmAEAoENgBgCADoEZAAA6BGYAAOhYNTBX1XlVdUtVfXKi7FVVdUNVXTXeTp5Y9/NVtbOqPlNVz5goP2ks21lVZ6/9rgAAwNqb5QzzW5KcNKX811trjxtvFyVJVR2f5NQkJ4yP+a9VdUhVHZLkt5I8M8nxSU4b6wIAwIa2abUKrbXLqmrLjNs7Jcn5rbWvJbmuqnYmeeK4bmdr7bNJUlXnj3Wvucc9BgCAOVo1MHe8tKpemOTKJGe11m5LckySyyfqXD+WJcnuZeVPmrbRqjozyZlJsnnz5iwuLu5DF++ZPXv23NXeWY+9c27tbhTzPNb3FpNjApYYF0xjXDCNcXFg2NvA/IYkr0nSxn9fl+SH1qJDrbVzk5ybJNu2bWsLCwtrsdmZLC4uZqm97WdfOLd2N4pdz1tY7y5sOJNjApYYF0xjXDCNcXFg2KvA3Fq7eel+Vb0pyfvHxRuSHDdR9dixLJ1yAADYsPbqa+Wq6qiJxWcnWfoGjR1JTq2qQ6vqUUm2JvmzJFck2VpVj6qq+2W4MHDH3ncbAADmY9UzzFX1riQLSY6squuTvDLJQlU9LsOUjF1JfiRJWmtXV9UFGS7muzPJS1pr/3vczkuTXJzkkCTntdauXvO9AQCANTbLt2ScNqX4zZ36r03y2inlFyW56B71DgAA1plf+gMAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBDYAYAgA6BGQAAOgRmAADoEJgBAKBj1cBcVedV1S1V9cmJsodW1SVVde347xFjeVXVb1bVzqr6eFU9YeIxp4/1r62q0/fP7gAAwNqa5QzzW5KctKzs7CSXtta2Jrl0XE6SZybZOt7OTPKGZAjYSV6Z5ElJnpjklUshGwAANrJVA3Nr7bIkty4rPiXJW8f7b03yrInyt7XB5UkeUlVHJXlGkktaa7e21m5Lckn+cQgHAIANZ2/nMG9urd003v98ks3j/WOS7J6od/1YtlI5AABsaJv2dQOttVZVbS06kyRVdWaG6RzZvHlzFhcX12rTq9qzZ89d7Z312Dvn1u5GMc9jfW8xOSZgiXHBNMYF0xgXB4a9Dcw3V9VRrbWbxikXt4zlNyQ5bqLesWPZDUkWlpUvTttwa+3cJOcmybZt29rCwsK0avvF4uJiltrbfvaFc2t3o9j1vIX17sKGMzkmYIlxwTTGBdMYFweGvZ2SsSPJ0jddnJ7kfRPlLxy/LePEJF8ap25cnOTpVXXEeLHf08cyAADY0FY9w1xV78pwdvjIqro+w7ddnJPkgqo6I8nnkjxnrH5RkpOT7ExyR5IXJUlr7daqek2SK8Z6r26tLb+QEAAANpxVA3Nr7bQVVj11St2W5CUrbOe8JOfdo94BAMA680t/AADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQITADAECHwAwAAB0CMwAAdAjMAADQsU+Buap2VdUnquqqqrpyLHtoVV1SVdeO/x4xlldV/WZV7ayqj1fVE9ZiBwAAYH9aizPMT2mtPa61tm1cPjvJpa21rUkuHZeT5JlJto63M5O8YQ3aBgCA/Wp/TMk4Jclbx/tvTfKsifK3tcHlSR5SVUfth/YBAGDNVGtt7x9cdV2S25K0JG9srZ1bVbe31h4yrq8kt7XWHlJV709yTmvtQ+O6S5O8vLV25bJtnpnhDHQ2b9787eeff/5e9++e2rNnTw4//PAkySdu+NLc2t0oHnvMg9e7CxvO5JiAJcYF0xgXTGNc3Hs85SlP+djEjIm72bSP2/7O1toNVfXNSS6pqk9Prmyttaq6R4m8tXZuknOTZNu2bW1hYWEfuzi7xcXFLLW3/ewL59buRrHreQvr3YUNZ3JMwBLjgmmMC6YxLg4M+zQlo7V2w/jvLUnem+SJSW5emmox/nvLWP2GJMdNPPzYsQwAADasvQ7MVXVYVT1w6X6Spyf5ZJIdSU4fq52e5H3j/R1JXjh+W8aJSb7UWrtpr3sOAABzsC9TMjYnee8wTTmbkryztfY/quqKJBdU1RlJPpfkOWP9i5KcnGRnkjuSvGgf2gYAgLnY68DcWvtskv9zSvkXkzx1SnlL8pK9bQ8AANaDX/oDAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAICOTevdATaOLWdfuN5dmLtd53zvencBANjgnGEGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADoEZgAA6BCYAQCgQ2AGAIAOgRkAADo2zbvBqjopyW8kOSTJ77TWzpl3H2DJlrMv7K4/67F3Zvsqde5tdp3zvevdBQC4V5nrGeaqOmdCEcUAAASCSURBVCTJbyV5ZpLjk5xWVcfPsw8AAHBPzPsM8xOT7GytfTZJqur8JKckuWbO/YCD1mpn1Q80zqgDsK/mHZiPSbJ7Yvn6JE+acx+Ag8j+eINwIE7VYd/Ne1x4MwjzM/c5zKupqjOTnDku7qmqz8yx+SOTfGGO7bHB/aQxwRTGBdPMe1zUr8yrJfaR14t7j0eutGLegfmGJMdNLB87lt2ltXZuknPn2aklVXVla23berTNxmRMMI1xwTTGBdMYFweGeX+t3BVJtlbVo6rqfklOTbJjzn0AAICZzfUMc2vtzqp6aZKLM3yt3Hmttavn2QcAALgn5j6HubV2UZKL5t3ujNZlKggbmjHBNMYF0xgXTGNcHACqtbbefQAAgA3LT2MDAEDHQReYq+qkqvpMVe2sqrOnrD+0qt49rv9oVW2Zfy+ZtxnGxc9U1TVV9fGqurSqVvzqGQ4cq42LiXo/UFWtqlwJfxCYZVxU1XPG14yrq+qd8+4j8zXD35BHVNUHquovxr8jJ69HP9l7B9WUjPGnuf9Xkqdl+NGUK5Kc1lq7ZqLOjyf55621H62qU5M8u7X23HXpMHMx47h4SpKPttbuqKofS7JgXBzYZhkXY70HJrkwyf2SvLS1duW8+8r8zPh6sTXJBUm+u7V2W1V9c2vtlnXpMPvdjGPi3CR/0Vp7Q1Udn+Si1tqW9egve+dgO8N8109zt9b+PsnST3NPOiXJW8f770ny1KqqOfaR+Vt1XLTWPtBau2NcvDzDd4hzYJvl9SJJXpPkV5L83Tw7x7qZZVy8OMlvtdZuSxJh+YA3y5hoSR403n9wkhvn2D/WwMEWmKf9NPcxK9Vprd2Z5EtJHjaX3rFeZhkXk85I8kf7tUdsBKuOi6p6QpLjWmt+J/vgMcvrxWOSPKaq/rSqLq+qk+bWO9bDLGPiVUmeX1XXZ/imsJ+YT9dYKxvup7FhI6uq5yfZluS71rsvrK+quk+S1yfZvs5dYePZlGRrkoUMn0ZdVlWPba3dvq69Yj2dluQtrbXXVdV3JHl7VX1ba+3r690xZnOwnWFe9ae5J+tU1aYMH518cS69Y73MMi5SVd+T5P9O8m9aa1+bU99YP6uNiwcm+bYki1W1K8mJSXa48O+AN8vrxfVJdrTW/qG1dl2G+a1b59Q/5m+WMXFGhnntaa19JMk3JTlyLr1jTRxsgXmWn+bekeT08f6/TfIn7WC6MvLgtOq4qKrHJ3ljhrBsPuLBoTsuWmtfaq0d2VrbMl68c3mG8eGivwPbLH9H/jDD2eVU1ZEZpmh8dp6dZK5mGRN/neSpSVJV35ohMP/NXHvJPjmoAvM4J3npp7k/leSC1trVVfXqqvo3Y7U3J3lYVe1M8jNJVvwqKQ4MM46LX01yeJLfq6qrqmr5iyEHmBnHBQeZGcfFxUm+WFXXJPlAkp9trfmk8gA145g4K8mLq+ovk7wryXYn4+5dDqqvlQMAgHvqoDrDDAAA95TADAAAHQIzAAB0CMwAANAhMAMAQIfADAAAHQIzAAB0CMwAANDx/wNqjgjatC/LIgAAAABJRU5ErkJggg==\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "plt.figure(figsize=(12, 8))\n",
        "ranks = np.arange(len(mean_tf)) + 1\n",
        "plt.plot(np.log10(ranks), np.log10(mean_tf.values))\n",
        "plt.title('Plot of the log of rank (by mean term frequency) versus the log of mean term frequency')\n",
        "plt.show()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 499
        },
        "id": "r_SRa0ZvhDSg",
        "outputId": "a7ff5e0a-4066-458c-cfb6-d98840e68e35"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 864x576 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAssAAAHiCAYAAAAeQ4G4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdeXwU9f3H8ddncyeEBEi4EgjIjcihHCKCoHhrxXpb6121ra3a1uqvp1qttvWstvW+61nrbRUUueRGBAUEuQlnCBAgIff398cMuMQkBHLMJnk/H488srtzfWZ2dva93/3OrDnnEBERERGR7woFXYCIiIiISKRSWBYRERERqYLCsoiIiIhIFRSWRURERESqoLAsIiIiIlIFhWURERERkSooLDchZjbJzK5uoGX92Mw2m9luM2tTg/EvN7NpdbTs28zsxbqY10Eu92wzW+ev86AGWN5qMxt7EOO/bGbj/Nt1tr2lag29TwTNzGab2eFB11GfDvZ1d4B5OTPrXhfzOohlJpjZu2aWZ2avN+SyGzsz62VmX5jZLjP7edD1SORQWG5k/AP5Hv/NebOZPWtmLQ5yHl38g3j0IdYQA9wPnOSca+Gcy63L+Uewe4Hr/XWeH3Qx4cysPzAAeDvoWhqKv+/fGXAZEbtP1JN7gTuCLqKuRMg+VNfOBdoBbZxz5wVdTE1FyAf8XwOfOueSnXN/D7gWiSAKy43Tmc65FsCRwGDgdw28/HZAPLCogZcbtCxquM4BfFC4Fvi3068M1VgdPUdV7hNN8MMiwDvAGDNrX58LaaLbrqFkAcucc6VBF9KQ6vv17C8jqg6WIY2QwnIj5pxbD/wP6FdxmJmFzOx3ZrbGzLaY2fNmluIPnuL/3+G3UA+vZPo4M3vQzDb4fw/6j/UEloZNP7GS0qqcv5nda2bbzWyVmZ0a9niKmT1lZhvNbL2Z3VnTA5OZfc/MFpnZDr8rSp+wYUea2Xz/a7XXzezVqlqSqtpm/nrvBqKABWa2oorpnZn91My+Ab7xH3vI/5p+p5nNM7ORYePfZmav+cvZ5a/D4Crm3cffZhdVsRlOBSZ/dzJ7xP869mszO8F/8Dwzm1dhxF+YWaWt0v42vdPMpvvP57tm1sbM/u2v1xwz6xI2fm8zm2Bm28xsqZmdHzbsdP/52Olvl9vChu39RuIyM1trZlvN7LdV1HQN8APg13tr8h/vaGZvmFmOv71+HjbNbWb2HzN70cx2Apcf7LqFzavSfcK8b35uMbOFQL6ZRZvZ0f78d5jZAjMbHTafrmY22X/+J/jP14v+sNFmll1hufu6CPj7661mtsLMcv19qXVNtqWZRZnZb/xpd/n7Zicz+4eZ3Vdhme+Y2U0AzrlCYB5wchXbZIeZ9Qt7LN28b8La+vfPMO9r7h3+NulfYd0qbrtbzDse7PL3pb378H4twhW3VVXTVai30n3IN9DMFpr32nnVzOLDpqtyHapj3rHkeX/fXGPesSYU9nzc5z9Pq8zseqvm2znzjgeT/BoWmdn3/MdvB/4AXOCv01WVTHubecfCF/3t86WZ9TSz/zPvuLfOzE6qUHelx2Yz62ZmE/39b6t5r5vUsGlXm9mvqtqW4esDPAoM9+ve4T8eZ957xlrzvkl91MwS/GGjzSzbf643Ac8c7LpVqGEiMAZ4xK+hp7+f/cvMPjCzfLwPitUdYxL8abab2WIzu7nCfrlft5xK9uMDvT6q3JZmdpY/7U7zXten2EEe66Uazjn9NaI/YDUw1r/dCe9T8J/8+5OAq/3bVwLLgcOAFsB/gRf8YV0AB0RXs5w7gJlAWyAdmB62nGqnr2w4cDlQAvwIL2D8GNgAmD/8TeAxIMlf5mzg2irmfxvwon+7J5APnAjE4H2NthyI9f/WADf4w74PFAN3VjHfKreZP9wB3avZZg6YALQGEvzHLgHaANHAL4FNQHzYehQCp/nb5G5gZsXnGu8bhLXAGVUsN8lfdnqF7V0K3OSv+wVAnl9bHLAN6BM2/nzgnCrmP8nfLt2AFGAxsMyvLRp4HngmrJZ1wBX+sEHAVqCvP3w0cATeB/X+wGZgXIX95gkgAa9bSVF4nRXqejb8ufTnOQ8vLMT6z+NK4OSw7V0CjPPHTTiYdavmOe8edn818AXeazMByABy/ec4hLef5u59roAZeF2a4oBRwC6+3bdHA9nVvP5vwHuNZvrTPwa8XJNtCdwMfAn0Aswf3gYYive6DPnjpQEFQLuwGv4O3F/F9ngauCvs/k+BD/3bg4AtwDC8/f0yf33iqth2vfD2pY5h69Stiud+37aqbroD7UNhdcwGOuK9XpYA19VkHarbP/D2pbeBZL+mZcBV/rDr8Pa9TKAV8DFVHGPxXs/Lgd/g7efH4+03vSoeH6s5fhbifeDZu4+vAn7rz/tHwKqw8as8NgPd8fbpOLz3iSnAgzXZlpXUdTkwrcJjD+B9m9Ha327vAneHPeelwF/85Scc7LpVcay7usL+kQeMwHv9JlL9MeYeYKpfbyfgK8Jew3z3ePEs/v5HzV4fVe2XQ/06T/TrzAB6c5DHev1V/Rd4Afo7yCfMe8HsBnbgBcF/8m0w2/dCBz4BfhI2XS+8oBBNzcLyCuC0sPsnA6v929VOX9lwvAPh8rD7if447fG6dRTtXQ9/+EV4fccqm/9tfBsofg+8FjYsBKzHO5CO8m9b2PBpVB2Wq9xm/v2ahOXjD/D8bQcGhK3Hx2HD+gJ7KjzXtwPZwOhq5pnhLzu+wvbe92HEf2w28EP/9r/wQw1wuF9XVW/4k4Dfht2/D/hf2P0zgS/82xcAUytM/xjwxyrm/SDwQIX9JrNCzRdWMe2z7B+YhgFrK4zzf3wb5G8DphzqulXznFcMy1eG3b+FsA9c/mMf4b0RdsZ7s08KG/YSNQ/LS4ATwoZ14Luv8Uq3Jd63Q2dVsU5LgBP929cDH1QYfhfwdBXTjgVWhN3/DLg0bJ/7U4XxlwLHVbHtuuOFh7FAzAGe+33bqrrpDrQPhdVxSdj9vwKP1mQdqto/8MJPMf6HRn/YtcAk//ZEwhoH/NqrCssj8T50h8Ieexm4LWw/P1BYnlBhH98NRPn3k/1lp3Lwx+ZxwPyabMtKpr2csLCM9yEun7APOsBw/LDrP+fF7H/cq/G6VVHDJL4blp8Pu3+gY8xK4JSwYddQ87Bck9dHVfvlY/jH0UrWqcbHev1V/ad+YY3TOOfcxwcYpyNemN5rDd6baLsaLqOy6TvWuMLKbdp7wzlXYGbgteC2xvvUv9F/DLzQu+5g63TOlZvZOrwAWQasd/5RwlfdPKvbZutrUMt35m9mvwKu8uftgJZ4rXV7bQq7XQDEm1m0+7a/4XXAZOfcpGqWucP/n4zXqrJXxXUPfw6fA142s98BP8T7wFFUzTI2h93eU8n9vSeZZgHD9n6N6osGXgAws2F4rS/98Fpm4oCKZ+xX3CY1PYE1C+hYYdlReC09e1X2/Nd03WoqfBlZwHlmdmbYYzHAp3jPxXbnXH7YsDV4LVI1kQW8aWblYY+Vsf9rvKpt2QnvA3FlnsP7RmSC//+hCsOT+Xafq+hTINF/njcDA/FaJvfWe5mZ/Sxs/Fj2P67s23bOueVmdiNeADrczD4CfuGc21DFsms1XQUVt9veGmuyDpVJw3veKx5fMvzbHdl/vznQcWqdcy78eQ+fV01U3Me3OufKwu6Dt690pJpjs5m1w9s/RuLtFyG8MBauqm15IOn4Lblhyza81/ReOc7rGnQo61bVPlxRxddzdceYis9j+PN9IDXZt6ralp2AD6qY78Ee66US6rPcdG3Ae/HttbcVazNeaDuU6Wv6ZlOT+Ydbh9d6keacS/X/WjrnanKJqv3qNO+o2gkv3G4EMizsSEv1QaS6bVZT+9bdvP7JvwbOB1o551LxviqzKqatzHVAZzN7oMoFemFrBV6XlHAV133fc+icm4nXKjMSuBg/zNaBdXjhPjXsr4Vz7sf+8Jfwvlbt5JxLweuneDDbI1zF/WwdXqtT+LKTnXOnVTNNfaj44eyFCjUlOefuwds/W5lZUtj4ncNu5+OFBWDfyUXpFeZ9aoV5xzvvXIYDWYfX9aQyLwJnmdkAoA/wVoXhfYAFlU3oh5LX8FofLwLec87tClvmXRXqTXTOvRw+iwrze8k5dyze69LhfeUOFbYN3jdUNZnuOyVX8XhVarIOldmK1+pf8fiy97naiNcFY68DHac6md/fuZJ51aUDHZv/jLcNj3DOtcT7cFVXr+eteOH28LBlpzjv5PaqpqkPFV/P1R1jNrL/cxf+egYv4Fa13x7qvrV32kpfz/V4rG9WFJabrpeBm8w7gagF3kHtVb/FMgcox+tvVd30vzPvBJ00vD5aNb22cU3mv49zbiMwHrjPzFqad+JSNzM7rgaTvwacbmYnmHdJu1/iHdyn4/UHLQOuN+9kobPw+nZVpbptdiiS8cJ2DhBtZn/Aa1k+GLuAU4BRZnZPNeN9AFTcXm2Bn5tZjJmdhxdywlsfngceAUqcc3V1yab3gJ5m9kN/uTFmNsS+PekyGdjmnCs0s6F4B+9DtZn997HZwC7/hJ8E806a6mdmQ2qxjNp6ETjTzE7264n3T0zKdM6tAeYCt5tZrJkdi/e18V7L8L5pON3ft3+H1xK/16PAXWaWBftOpjurhnU9CfzJzHqYp7/510t3zmUDc/DeVN9wzu1tjcM/oegovFbnqryE1x3nB/7tvZ4ArjOzYf4yk/x1S65sJuZd8/Z4M4vD+8ZkD95xBby+zaeZWWvzrsxxYw2nq6jiPnQgB7UOe4V9iLjLzJL95+wXfHtMfQ24wcwyzDtB7pZqZjcLL3T92n99jcbbb145iPWokRocm5PxujnkmVkGXl/4Q7UZyDSzWH/Z5Xjb+wH79gTRDDP7zsmlDehAx5jXgP8zs1Zmlgn8rML0XwAX+9Odwv7H7EPat3xPAVf474Mhfzv1DhteH8f6ZkVhuel6Gu/NbgreCQ6F+C9c51wBXr/Dz8w76/boSqa/E++NfCHeiUCf+48dUA3nX9GleF85Lcb7Gu8/eH0wD7SspXitGQ/jtUSciXdpvWLnXDHeSX1X4X3ldglemKvqK6gqt9kh+gj4EC/0rPHnV5OuJftxzu3AO3HjVDP7UxWjPQ78oEJL8iygB952uQs41+1/TewX8LpD1NkPvPitiCcBF+K1gG3i2xNwAH4C3GFmu/A+gL1Wi8U9BfT197G3/EByBt5X/6vw1vtJvBP3AuGcWwechXcyVg7e838z3x57L8brB7kN+CPem9reafPwtteTeK2G+Xj91/d6CK+Vfry/PWf686qJ+/G2/XhgJ962TAgb/hzeiZgVW6HOxOtnW+W3TM65WX6tHfGu1rP38bl4J1g9gvcaX47XT7UqcXhddrbi7Udt8fqH4te1AK8f53jg1RpOV9F++1A1tRzqOoT7Gd52WYl37sRLeMcc8ILSeLzj7Xy8D7WleB/2K9ZQjPc8nIq3jv/E6xf+dQ3rOFjVHZtvxzsBOQ94H++k6EM1Ee+E9U1mttV/7Ba8bTzTvCvYfIx3LkkganCMuR3vWL8K7/ms+Pq5Ae+524H3YXLfPlebfcs5NxvvpOoH8J6Lyez/LUadH+ubm71XIhBpFsxsFt5JEc8EXUtdM7OX8PqjHfBN3x8/Ae9EqCOdc9/Ua3FSI+ZdSq+7c+6SgOsYhffGmhXe791//VzlnPsqsOKaAfMuq/mocy7rgCNLxPJb/V90zmUeaNx6rkPH+lrSCX7SpPlfFy7FawH4Ad7lyj4MtKh64pw72C4NPwbm6OAp4fwuHzcAT1Y4QRTnXE1bruUg+GFmDF5rZDu8bxnerHYikZrTsb6WFJalqeuF93VzEt7Xn+f6/fCaNTNbjXcizriAS5EI4vctn4vXxeGKgMtpTgzvK/xX8fpYv4/XTUmkVnSsrxvqhiEiIiIiUgWd4CciIiIiUgWFZRERERGRKkR0n+W0tDTXpUuXoMsQERERkSZs3rx5W51z6ZUNi+iw3KVLF+bOnRt0GSIiIiLShJlZlT9Prm4YIiIiIiJVUFgWEREREamCwrKIiIiISBUUlkVEREREqqCwLCIiIiJSBYVlEREREZEqKCyLiIiIiFShTsKymZ1iZkvNbLmZ3VrJ8Dgze9UfPsvMutTFckVERERE6lOtw7KZRQH/AE4F+gIXmVnfCqNdBWx3znUHHgD+UtvlioiIiIjUt7poWR4KLHfOrXTOFQOvAGdVGOcs4Dn/9n+AE8zM6mDZIiIiIiL1pi7CcgawLux+tv9YpeM450qBPKBNHSxbRERERKTeRNwJfmZ2jZnNNbO5OTk5QZcjIiIiIs1YXYTl9UCnsPuZ/mOVjmNm0UAKkFvZzJxzjzvnBjvnBqenp9dBeSIiIiIih6YuwvIcoIeZdTWzWOBC4J0K47wDXObfPheY6JxzdbBsEREREZF6E13bGTjnSs3seuAjIAp42jm3yMzuAOY6594BngJeMLPlwDa8QC0iIiIiEtFqHZYBnHMfAB9UeOwPYbcLgfPqYlkiIiIiIg0l4k7wC9q2/GJKysqDLkNEREREIkCdtCw3JTe8Mp/Plm8lo1UCXdok0bl1ove/TeK++wmxUUGXKSIiIiINQGG5gh8My2Jgp1RW5xawNjef9xZuJG9PyX7jtGsZR1brJLLaJNIlbf9AnZIQE1DlIiIiIlLXFJYrOKVfe07p136/x3YUFLMmt4DVufmszS3wgvS2fCYtyyFnXvZ+47ZKjKFzmyS6tEkkq3UiWW28UJ3VJom0FrHohwtFREREGg+F5RpITYwlNTGWAZ1SvzMsv6iUtdsKWJNbwJrcfNZs8/7PXb2ddxdsoDzsAnlJsVH7gvTebh1ZrRPJSkuiQ8t4QiEFaREREZFIorBcS0lx0fTp0JI+HVp+Z1hRaRnZ2/f4rdH5+wL10s27+HjJZkrKvk3SsVEhOrVO2NcSHd5POiM1gdhonYspIiIi0tAUlutRXHQU3dJb0C29xXeGlZU7Nubt8QO0F6L3BuoZK3LZU1K2b9yQQUarhH39pPd269AJhyIiIiL1S2E5IFEhI7NVIpmtEhnRff9hzjlydhftF6T3/j/QCYd7g/Te/zrhUEREROTQKSxHIDOjbXI8bZPjGdKl9XeG7z3hcM22AtZs/baf9ORlOWzZVbTfuKmJMWS1TiQlMZak2CgSY6NJivP+J8ZGkRgbRVKcdzspNprEOO//3nGSYqNJiI1SNxARERFplhSWG6HqTjgsKPZOOFy91btix+rcAtZtKyBvTwkbd+yhoLiM/OJSCorKKD6IH1+JiTI/PEeRGBf9neCdFBdFQsz+9xNjo0lrEcuoHuk6eVFEREQaJYXlJiYxNpre7VvSu/13TzisqKSsnILiMgqKS8kvqvC/uIyCotJvh/v388PG31Ncxsa8Qi+A++PmF5fi3P7LGdKlFX85pz+HVdJ3W0RERCSSKSw3YzFRIVISQnXar9k5R2FJOfnFpewpLmPGylzuen8Jpzw0lV+c2JOrj+1KdJS6dIiIiEjjoLAsdcrMSIiN2neFjk6tExndK53fv/UV9/zvaz74ciN/Pbd/jVq+RURERIKmJj6pd22T43n0kqP4x8VHsn77Hs58eBoPTFhGcWnN+0yLiIiIBEFhWRqEmXF6/w5M+MVxnH5EBx765Bu+98g0FmbvCLo0ERERkSopLEuDap0Uy4MXDuKpywazvaCYcf/4jLv/t4TCsB9hEREREYkUCssSiBP6tGP8Tcdx/uBOPDZ5Jac9NJU5q7cFXZaIiIjIfhSWJTApCTHcc05/XrxqGMVl5Zz/2Az++PZX5BeVBl2aiIiICKCwLBHg2B5pfHTjKC4b3oXnZ67h5AenMO2brUGXJSIiIqKwLJEhKS6a2753OK9fO5zYqBCXPDWLW/6zkLw9JUGXJiIiIs2YwrJElMFdWvPBDSO57rhuvD5vHSc9MJmPF28OuiwRERFpphSWJeLEx0Rx66m9eeunI2iVGMvVz8/lhlfmsy2/OOjSREREpJlRWJaI1T8zlXeuP5Ybx/bggy83cuL9k3lv4Qacc0GXJiIiIs2EwrJEtNjoEDeO7cl7PxtJZqsErn9pPte+MI8tOwuDLk1ERESaAYvkVrrBgwe7uXPnBl2GRIjSsnKemraK+ycso9w5erdvSf/MFAZkptK/Uwo92iYTFbKgyxQREZFGxszmOecGVzpMYVkam5U5u3ltbjYLs3fwZXYeu/zrMifERNEvoyX9M1P3heisNomYKUCLiIhI1aoLy9ENXYxIbR2W3oJbT+0NQHm5Y1VuPguzd7BgXR4Ls3fw4sw1FJWWA94Pn/TPTPH/UhmQmUr7lPggyxcREZFGRC3L0uSUlJWzbPMuFmbn7QvRSzfvoqzc29fbJsfRPzOVI7NSGdUjnb4dWhJS9w0REZFmS90wpNkrLClj0YadLMzewcLsPBZk72BlTj4AaS1iGdkjnVE90xjZI520FnEBVysiIiINSd0wpNmLj4niqKxWHJXVat9jObuKmLY8h8lLc5iyLIc3568HoF9GS0b1SGdUz3SOympFTJQuGiMiItJcqWVZBK/v86INO5nyjRee563dTlm5o0VcNMO7tWFUz3RG90ynU+vEoEsVERGROqZuGCIHaWdhCdOX5+4Lz+t37AGga1oSfTokExUKEWUQMiMUMqL8/yGDqJARMiMqZESHjJMOb79fi7aIiIhEFoVlkVpwzrFyaz5TluUweVkO67YVUO6g3DnKyh3l5Y4y57zH9t4u9+4XlZZRUuY458hMbj21N+nJ6g8tIiISaRSWRQKSX1TKI58u58mpK4mPieIXJ/bkh0dnEa1+0CIiIhGjurCsd2yRepQUF80tp/TmwxtHMbBTKre/u5gzHp7G7FXbgi5NREREakBhWaQBdEtvwfNXDuXRS45k554Szn9sBje9+gVbdhYGXZqIiIhUQ2FZpIGYGaf068DHvzyOn47pxvsLN3L8fZN5cupKSsrKgy5PREREKqGwLNLAEmOjufnk3nx00yiOymrFne8v4Yy/T+Ot+evJ2VUUdHkiIiISRif4iQTIOcf4xZv503uLyd7uXZ6uV7tkhndrw4juaQw7rDUt42MCrlJERKRp09UwRCJcWbnjq/V5fLZiK9OX5zJn9TaKSsuJChlHZKQwonsbRnRL48isVsTHRAVdroiISJOisCzSyBSWlDF/7Q6mr9jKZ8u3siA7j7JyR1x0iMFdWnFMtzSO6daGIzJSdBk6ERGRWlJYFmnkdhWWMHvVNj5bnsv0FVv5etMuAJLjoxnWtY3X8tw9jR5tW2BmAVcrIiLSuFQXlqMbuhgROXjJ8TGc0KcdJ/RpB8DW3UXMWJHrtzzn8vGSzQAclpbE+UM6cc6Rmfq1QBERkTqglmWRJmDdtgKmfrOVN+dnM2f1dqJDxtg+7bhgaCdG9UgnKqTWZhERkaqoG4ZIM7J8y25em7uON+Zlk5tfTIeUeM4b3IkLhnQiIzUh6PJEREQijsKySDNUXFrOJ0s288qcdUz5JofokHHBkE5cP6YH7VPigy5PREQkYigsizRz2dsLeHTyCl6ZvY5QyLhkWBY/Ht1N/ZpFRERQWBYR37ptBfz9k2944/Ns4qKjuHxEF64ddRipibFBlyYiIhIYhWUR2c/KnN089Mk3vLNgA0mx0fTt0JK05FjSWsTRJilu3+3MVglktUmiRZwunCMiIk2XwrKIVGrppl08NW0lq3ML2Lq7iK27ithZWPqd8dJaxJLVJomsNol0apVImxaxpCbG0ioxhlaJsbRpEUv7lvG6xrOIiDRKCssiUmPFpeXk5heRs6uI7O17WJ2bz9rcAlbn5rMmt4CNeYWVTpfZKoET+7bjxL7tGNqltX5ZUEREGg2FZRGpMyVl5ewoKGFHQTHbC0rYXlDMprxCJi/LYdryrRSXlpOSEMNJfdvxuzP6kpIQE3TJIiIi1dIv+IlInYmJCpGeHPedK2lcdkwX8otKmbIshwmLN/PG59kkxUVz2/cOD6hSERGR2lNYFpE6kxQXzalHdODUIzoQHxvFizPXcOnwLA5LbxF0aSIiIodEnQpFpF7cNLYncdEh7v7f10GXIiIicsgUlkWkXqQnx/GTMd2ZsHgzM1fmBl2OiIjIIVFYFpF6c+WIrnRIieeu95dQXh65JxOLiIhURWFZROpNQmwUN5/ciy/X5/H2gvVBlyMiInLQFJZFpF6NG5jBERkp/O3DpRSWlAVdjoiIyEFRWBaRehUKGb85rQ8b8gp5atqqoMsRERE5KLp0nIjUu+Hd2nBS33bcN34pzjl+Mro7oZB+GltERCKfWpZFpEE8cMFAzujfkXvHL+PK5+awPb846JJEREQOSGFZRBpEUlw0D104kD+N68f05bmc/vepfL52e9BliYiIVEthWUQajJnxw6Oz+M+PhxMKGec/OoM73l3MjgK1MouISGRSWBaRBtc/M5X3fzaSc4/K5Nnpqxj11095cupKikp1tQwREYksCssiEoiUxBjuOac/H9wwkgGdUrnz/SWc9MAUVm/ND7o0ERGRfRSWRSRQvdu35IWrhvHclUPZtruYP723OOiSRERE9lFYFpGIcFzPdH4ypjuffL2F6cu3Bl2OiIgIoLAsIhHkihFdyEhN4M73l1Be7oIuR0RERGFZRCJHfEwUvz6lF4s37uS/89cHXY6IiIjCsohEljP7d2RAZgr3frSUPcW6OoaIiARLYVlEIkooZPzujL5s2lnIk1NXBl2OiIg0cwrLIhJxhnRpzSmHt+dfk1ewJleXkhMRkeAoLItIRPrNaX2IjQ5x8ROzyN5eEHQ5IiLSTCksi0hE6twmkRevGsbOwhIufmIWm/IKgy5JRESaIYVlEYlY/TJSeP7KoWzLL+biJ2ayZZcCs4iINCxzLnKvZTp48GA3d+7coMsQkYDNWb2Ny56eTWx0iL4dWtK9bQt6tEvm7EEZtIiLDro8ERFp5MxsnnNucGXD9C4jIhFvSJfWvHj1MF6cuYYVOfn89/P17C4q5dOvt/DUZYMxs6BLFBGRJkphWUQahSM7t+LIzq0AcM7x1LRV3Pn+El6Zs46Lhk8m2dYAACAASURBVHYOuDoREWmqatVn2cxam9kEM/vG/9+qivHKzOwL/++d2ixTRMTMuHJEV0Z0b8Of3lusy8uJiEi9qe0JfrcCnzjnegCf+Pcrs8c5N9D/+14tlykiQihk/O3cAUSFjF+8toCy8sg9/0JERBqv2obls4Dn/NvPAeNqOT8RkRrrmJrAHWcdzrw127nnf0soKC4NuiQREWliahuW2znnNvq3NwHtqhgv3szmmtlMM6s2UJvZNf64c3NycmpZnog0deMGZnD2oAyemLqKYX/+hNveWcTXm3ZSrpZmERGpAwe8dJyZfQy0r2TQb4HnnHOpYeNud859p9+ymWU459ab2WHAROAE59yKAxWnS8eJSE0455izejsvzlzD/77aSEmZIzk+mv6ZKfTPTGWA/79DSryunCEiIt9Rq0vHOefGVjPjzWbWwTm30cw6AFuqmMd6//9KM5sEDAIOGJZFRGrCzBjatTVDu7YmZ1dfJn69mQXZeSzM3sETU1ZS6rcypyfH0T/DC859OiTTtmU86clxtEuOIzpKv9EkIiLfVdtLx70DXAbc4/9/u+II/hUyCpxzRWaWBowA/lrL5YqIVCo9OY4LhnTmgiHe/cKSMpZs3MnC7DwWZO9gYXYeE5duIfxLtY4p8dwwtgfnHJmp0CwiIvup1S/4mVkb4DWgM7AGON85t83MBgPXOeeuNrNjgMeAcrw+0g86556qyfzVDUNE6sOuwhJW5OSzdVcRm3cV8trcbBas28FhaUnceXY/jumWFnSJIiLSgKrrhqGfuxaRZs85x4TFm7nnf1+zOjefG8f25KdjuhMVUv9mEZHmQD93LSJSDTPjpMPbM6J7Gr9980vun7CMD77cSFqLOADap8TTo20LBnZKZUiX1oQUokVEmg2FZRERX1JcNA9cMJBjuqXx+rx17Ckpo6zcMXlZDv+Zlw1AlzaJXDS0M1eM6EpstPo3i4g0deqGISJSAzsKipm0NIeXZq9l9qptjO6VzqOXHEV8TFTQpYmISC1V1w1DzSIiIjWQmhjLuEEZvHbtcO7+/hFMXpbD5c/MZmdhSdCliYhIPVJYFhE5SBcN7cwD5w9kzurtjLh7Ine+t5iNeXuCLktEROqBwrKIyCEYNyiDt34ygjG92/Ls9NWc9MAU3l2wIeiyRESkjqnPsohILa3JzefGV79g/todHNcznaMPa8Ox3dM4IjMl6NJERKQGdJ1lEZF6VlJWzj8+Xc5/P1/P2m0FAFw0tBO3ntqHlISYgKsTEZHqKCyLiDSg7fnFPDp5BU9MXQlASkIMSXHRmEHb5HhO6NOWM/t3pFPrxIArFRERUFgWEQnEV+vzGL9oE9sKiikoKsMBK3J2szA7j6iQ8f1BGfx0THe6pCUFXaqISLOmX/ATEQlAv4wU+mV8t9/y+h17eHLqSl6atZY3Ps9m3MAM/jSuH0lxOiSLiEQaXQ1DRKSBZaQm8MczD2fqLWO4ckRX/jt/PS/PXht0WSIiUgmFZRGRgLRNjud3Z/SlW3oSk5flBF2OiIhUQmFZRCRgY3q1ZdbKbRQUlwZdioiIVKCwLCISsNG92lJcVs705blBlyIiIhUoLIuIBGxI11YkxkYxadmWoEsREZEKFJZFRAIWFx3FMd3SmLhkC1+s20FxaXnQJYmIiE9hWUQkApzevz0b8goZ94/PuPLZOUGXIyIiPoVlEZEIcPagTKb+egzXjDqMacu38mV2XtAliYgICssiIhGjU+tErj++OwkxUbwwc3XQ5YiICArLIiIRpWV8DOMGZfD2FxvYUVAcdDkiIs2ewrKISIT54dFZFJWWc9SdH3PyA1NYvmV30CWJiDRbCssiIhGmb8eWPHP5EK477jBy84u58PEZLNm4M+iyRESaJYVlEZEINKZ3W24+uTevXHM0ITPOfHgat7+7iN1F+pU/EZGGpLAsIhLBurdtwfs/H8l5gzN5dvpqTntoKpOX5VBe7oIuTUSkWTDnIveAO3jwYDd37tygyxARiQhzVm/jple/IHv7HtKT42idGMvhGS259dTetE2OD7o8EZFGy8zmOecGVzpMYVlEpPHYU1zG+MWbmPj1FvKLypiyLIe4mBDnHpXJ1SMPIyM1IegSRUQaHYVlEZEmakXObu6fsIzxizaR3iKO1398jAKziMhBUlgWEWniFm/YyQWPzyAuOopBnVO5eGhnxvRuG3RZIiKNQnVhWSf4iYg0AX07tuSFq4ZxVFYqi9bnccWzc7js6dm6RrOISC2pZVlEpIkpLi3n+RmreeiTbygqLeeZy4cwonta0GWJiEQstSyLiDQjsdEhrh55GBN/OZqubZK4+rm5vLtgA5HcOCIiEqkUlkVEmqj05DheuHooPdq14Gcvz+eKZ+ewNrcg6LJERBoVdcMQEWniysodz01fzb3jl7KnpIzBWa04/YgOXDSsM3HRUUGXJyISOHXDEBFpxqJCxpXHduWTXx7HTWN7squwlNveXczt7y4OujQRkYgXHXQBIiLSMDqkJPDzE3rw8xN6cPf/lvDY5JUUlpRx9qAMRvZID7o8EZGIpLAsItIM/fLEXmzOK+STJVv47+frOb1/B8b2acu4gRmYWdDliYhEDIVlEZFmKDY6xIMXDqKotIz7xi/jjXnZvL9wIwvW5fGb0/oQG61eeiIioBP8REQEcM5x1/tLeHLaKrqmJXF877ZcOjyLrDZJQZcmIlLv9HPXIiJSIxO/3syjk1ayIHsHzkG7lDguPboLV4/squ4ZItJkVReW1Q1DRET2Ob53O47v3Y7NOwt5cupKvlyfx10fLMEMrh55WNDliYg0OHVKExGR72jXMp7fnt6Xl64+mpE90nh08goKS8qCLktEpMGpZVlERKoUChk/Gd2di56Yydn/nE5yfDQGnHx4ey4/pguhkLpmiEjTprAsIiLVOvqw1lw6PItlm3cBsHNPKXe8t5iHJ35D2+R4zhucyZUjuio4i0iTpLAsIiLVMjPuOKvfvvvOOT78ahOTluawImc3d76/hGWbd3HNqG50b9siwEpFROqeroYhIiKHzDnH/ROW8fDE5QD8eHQ3bjihB/ExUQFXJiJSc7p0nIiI1BvnHJ+v3cFLs9byxufZxEWHOKVfe+49bwAxUTqPXEQiny4dJyIi9cbMOCqrFYM6pXLmgA5MWLyZf89aS3QoxNg+bQmFjP6ZKXRISQi6VBGRg6awLCIidSIUMkb3asvoXm2Jj4niqWmreOPz7H3D+3RoyZkDOnDliK7qpiEijYa6YYiISL3I3V3Ell1FFJWWM2tlLh8v2cyc1dtJjosmo1UCh3dMoU+HZC4d3oXYaHXXEJHgqM+yiIhEhFkrc3l7wQbWb9/Dl+vz2JZfTPe2LTi+d1tG90pnWNc2ROkSdCLSwBSWRUQkIk1YvJm7/7eE7G17KC4rp21yHMf3bkv3ti24cGhnWsSpt6CI1D+d4CciIhHpxL7tOLFvOwqKS/lkyRbeXbCBD77cyM7CUpZu2sXfzhsQdIki0swpLIuISOASY6M5c0BHzhzQEYBfvb6A9xZu5Pdn9qVlfEzA1YlIc6YzKkREJOJcfkwXCkvLuOjxmVz/0ue8Pncde4rLgi5LRJohhWUREYk4/TJS+Os5/Skrd8xbs52b/7OQEx+YzF8+/JrlW3YHXZ6INCM6wU9ERCKac45Jy3L48/tLWLk1n7Jyx/G92/Lj0d04snMrXT1DRGpNJ/iJiEijZWaM6dWWMb3asnV3ES/PWsvDny5n4tdbiI8JcfkxXfn1yb0IKTSLSD1QWBYRkUYjrUUcPzuhB+MGZTB71TY++HIjj05eweKNO7n7+0eQkaqf1BaRuqVuGCIi0mg553jms9XcN34pyfEx/OyE7owbmEGSrs8sIgehum4YOsFPREQaLTPjymO78uq1wylzjt+++RVXP6dGFhGpOwrLIiLS6PXLSGHaLWO44YQezFiZyxfrdgRdkog0EQrLIiLSJMRFR/GjUYeRHB/NIxO/oahU12UWkdpTWBYRkSajRVw0Pxp5GB8v2cKJ909h1db8oEsSkUZOYVlERJqUnx3fneeuHMrOwhLG3DuJUx6cwseLN6ulWUQOicKyiIg0KWbGcT3Tee9nx3Lj2B7sKSnj6ufnMuqvn+ons0XkoOnScSIi0qQVl5Zz3/ilPDZlJZmtEjipb3suHNqJnu2Sgy5NRCKEfsFPRESardjoELee2pvoKGPxhp08M30VT3+2ipE90jj3qEy+N6AjZvr1PxGpnFqWRUSkWVmRs5t3vtjA8zNWs72ghNZJsXxvQEd+cVJPWsbHBF2eiASgupZlhWUREWmWyssd/52/nrfmr2fa8q2cc2Qm950/IOiyRCQA6oYhIiJSQShknHtUJucelcmfP1jC41NW8vWmnQzt2ppfn9ybhNiooEsUkQigq2GIiEizd/3x3fnx6G4UlpTxzGer6fOHD3l62qqgyxKRCKBuGCIiImEmLN7MQ58sY8WWfE46vB0juqeRkZrAMd3a6ERAkSZK3TBERERq6MS+7ejetgW3vrGQT5Zs4e0vNgDQNjmOZ64YwuEdUwKuUEQaklqWRUREqrCnuIzVufl8vnY793zwNT3bJ/Pns4+gV3tdo1mkKamuZVl9lkVERKqQEBtFnw4t+cGwLG4Y24N5a7Zz8oNT+MPbXxHJjU0iUnfUDUNERKQGrh55GMO6tuEnL83j+RlriI0KceoR7Tkqq3XQpYlIPVI3DBERkYOwu6iUE+6bxOadRZjBqf3ac/oRHTm1X3tCIZ0AKNIY1dsJfmZ2HnAb0AcY6pyrNNma2SnAQ0AU8KRz7p7aLFdERCQoLeKimXHrCewqKuWOdxfz5vxsPvhyE4d3bMmp/drTPzOVY7unKTiLNBG1alk2sz5AOfAY8KvKwrKZRQHLgBOBbGAOcJFzbvGB5q+WZRERiXS7Ckv496y1PPvZajbtLATgrrP78YNhWQFXJiI1VW8n+Dnnljjnlh5gtKHAcufcSudcMfAKcFZtlisiIhIpkuNjuO64bsz4v+MZf9MoBnVO5ZGJyykpKw+6NBGpAw1xNYwMYF3Y/Wz/MRERkSbDzOjZLpnrjuvGxrxCbnzlC5Zv2R10WSJSSwcMy2b2sZl9VclfvbQOm9k1ZjbXzObm5OTUxyJERETqzXE90xnQKZX3v9zI2Psnc/u7i4IuSURq4YAn+DnnxtZyGeuBTmH3M/3Hqlre48Dj4PVZruWyRUREGlR8TBRv/eQYFmbn8fDE5Tzz2Wp2F5byhzP7khwfE3R5InKQGqIbxhygh5l1NbNY4ELgnQZYroiISCDMjAGdUnn4okEM7dqa1+dlc8Rt4znpgcm8NGutftBEpBGpVVg2s7PNLBsYDrxvZh/5j3c0sw8AnHOlwPXAR8AS4DXnnL6TEhGRJi8hNorXrh3Os1cM4eJhndm8s4jfvPklL85aG3RpIlJD+lESERGRBlJe7rj4yZnMXLmNhy4cyFkDdb67SCSot0vHiYiISM2FQsad4/qR1SaRG175gsufmU3OrqKgyxKRaigsi4iINKDubZN592fHMm5gRyYtzWHIXR9z8+sLyC8qDbo0EamEwrKIiEgDaxkfw4MXDuK1a4czskcar8/L5uInZpK7W63MIpFGYVlERCQgQ7u25oWrhnHnuH4syM7je498xjsLNgRdloiEUVgWEREJ2CVHZ/HoJUexfscefv7yfOat2RZ0SSLiU1gWERGJAKf0a8+s35wAwO/fWkRpWXnAFYkIKCyLiIhEjHYt47n8mC4s3riT7/9rOpc/M5uXZ6+lvDxyL/Mq0tQd8OeuRUREpOHc9r3DSUmIYeo3OUxfkcukpTnc/cESzh/cidP6d+DIzq2CLlGkWdGPkoiIiESoPcVlPDZlBa/NWceGvELM4O2fjqB/ZmrQpYk0KfpREhERkUYoITaKG8f2ZNotxzPxl8eR1iKO7z3yGbNW5gZdmkizobAsIiIS4UIh47D0Fjz2w6OIDhk/fHo2c1frihkiDUFhWUREpJE4snMr3r5+BGXljnMfncGrc9YGXZJIk6ewLCIi0ogc3jGFKb8eQ4eUeG5540sueGwGb87PDroskSZLYVlERKSRyUhN4M2fjGBkjzS+2bKbm15dwPQVW4MuS6RJUlgWERFphNqnxPPCVcP4+4WDALj4iVlM/SYn4KpEmh6FZRERkUbs2B5pvHLN0QD88rUF/ObNL9myszDgqkSaDoVlERGRRu7ow9rw13P6U1hSxkuz1jL0z5/w2tx1QZcl0iQoLIuIiDQB5w/pxNzfncgfzugLwK//s5DHp6xgV2FJwJWJNG4KyyIiIk1EbHSIK4/tyse/GAXAnz/4mqPu/JiZ+hETkUOmsCwiItLEdG+bzMz/O4EfjexKcWk5Fz4+k/GLNgVdlkijpLAsIiLSBLVPiee3p/fliUsHA3DNC/NYk5sfcFUijY/CsoiISBN2Yt92vHP9CACO+9sklm/ZFXBFIo2LwrKIiEgT1z8zldvO9E78G3v/FP7x6fKAKxJpPKKDLkBERETq3+UjutKuZTyPfLqcv320lM07C7njrH5BlyUS8dSyLCIi0kycekQHHrpwIC3ionl+xhoufHwG63fsobzcBV2aSMRSWBYREWlGurdN5rNbj6d9y3hmrtzGiHsmcupDUyksKQu6NJGIpLAsIiLSzKQkxDDj/47n4YsG0adDS5Zu3kXv33/I/eOXBl2aSMRRWBYREWmGzIwzB3Tkg58fyx/9k//+PnE5z01fHWxhIhFGYVlERKQZMzOuGNGVD28cCcAf31nEDa/Mp6hU3TJEQGFZREREgN7tW/LBz0cSGx3i7S820Ot3HzLm3kl8+NVGnNMJgNJ8WSS/AAYPHuzmzp0bdBkiIiLNhnOOF2au4cWZa1i2eTcAGakJ3HveAIZ3axNwdSL1w8zmOecGVzpMYVlEREQqcs6xOreAP7z9FVO/2QrAFSO68MczDw+4MpG6V11YVjcMERER+Q4zo2taEi9cNYx/Xz0MgGc+W821L6gRS5oXhWURERGp1ojuacz//Ykkxkbx0aLNXP3cHAqKS4MuS6RBKCyLiIjIAbVKiuWjG0fRKjGGj5dsYehdn7Bhx56gyxKpdwrLIiIiUiOdWify+e9PZGCnVHYXlXLMPRO5+rm5bMorDLo0kXqjsCwiIiI1Zmb898fHcNPYnmS2SuDjJZs5+u5PuP3dReTtKQm6PJE6p7AsIiIiByUUMm4Y24Opvx7Dn8b1A7yT/857dHrAlYnUPYVlEREROSRmxg+PzmLR7SfTo20Llm3ezf3jlwZdlkidUlgWERGRWkmKi+aFq7zLy/194nJenbM24IpE6o7CsoiIiNRa+5R4Jv1qNAC3vPEleQXqvyxNg8KyiIiI1IkuaUn8dEw3AAb+aTy3vbOILbt0pQxp3BSWRUREpM788sRenNG/Ay3ionl2+mqG3vUJHy3aRFm5C7o0kUNizkXuzjt48GA3d65+VlNERKQx+uuHX/PPSSsAOCw9iWcvH0rnNokBVyXyXWY2zzk3uLJhalkWERGRevHrU3rzn+uG0yYplpU5+Yz626fMW7Mt6LJEDorCsoiIiNSbwV1aM/u3Y7lpbE8AzvnXDH775peUlpUHXJlIzSgsi4iISL2K8n/E5MlLvW+5/z1rLec8OoM9xWUBVyZyYArLIiIi0iDG9m3H9FuPB2DBuh2MvX8yBcWlAVclUj2FZREREWkwHVMT+OauUzksLYn1O/bQ9w8f6ZrMEtEUlkVERKRBxUSF+OimUQzolArAgDvGM3e1TvyTyKSwLCIiIg0uJirE69cOZ3SvdADOfXQGz01fHWxRIpVQWBYREZFAxEaHeObyITx04UAA/vjOImasyA24KpH9KSyLiIhIYMyMswZm8OGNIwG46ImZfLU+L+CqRL6lsCwiIiKB692+JXecdTgA3//ndP14iUQMhWURERGJCJcO70K/jJYUl5Vzzr9m8Mxnq4IuSURhWURERCLH69cew8MXDQLg9ncX8+TUlQFXJM2dwrKIiIhEjITYKM4c0JFnrxgCwJ3vL1GXDAmUwrKIiIhEnNG92vLBz72T/s751wxOuG8SW3YWBlyVNEcKyyIiIhKR+nZsyZ3j+jEgM4UVOfkM/fMnrN6aH3RZ0swoLIuIiEjEuuToLN78yYh9P14y+t5J/P2TbwKuSpoThWURERGJaKGQ8fgPB+/78ZL7JyxjlVqYpYEoLIuIiEjEi40OcdbADO47bwAAlzw5i8/Xbg+4KmkOFJZFRESk0TjnqEzat4xn/Y49fP+f0/n06y1BlyRNnMKyiIiINCqTbh7Nr07qCcAVz87hhZlrAq5ImjKFZREREWlU4mOiuP74Hvt+vOT3b33FzJW5AVclTZXCsoiIiDRKZw7ouO9azJc9PZui0rKAK5KmSGFZREREGq2+HVty9GGtKSot5/qX5rNuW0HQJUkTo7AsIiIijdozlw9lYKdUJizezHmPzqC4tDzokqQJUVgWERGRRi0hNoq3fjqCoV1as2lnIT996XNW5OwOuixpIhSWRUREpEl4+oohDMhM4eMlmznj79PI3q4uGVJ7CssiIiLSJLSIi+bt64/l/KM6saekjB88OYuychd0WdLIKSyLiIhIk/KXc/vTp0NL1uQWcN2L8ygoLg26JGnEFJZFRESkyXnxqqHERoWYsHgzv39rEau25gddkjRSCssiIiLS5LRpEceiO06mY0o8b3yezc2vL+Cz5VuDLksaIYVlERERaZJiokJMveV4zh6Uwdw127n06dl8+NVGlm7aFXRp0ogoLIuIiEiTFRUy7j1vAI9cPIiycsd1L37OyQ9OYf2OPUGXJo2EwrKIiIg0aVEh4/QjOjDhplHcempvAEbcM5G8PSUBVyaNgcKyiIiINHlmRo92yVx1bFcuHNIJgAG3j2fLzsKAK5NIV6uwbGbnmdkiMys3s8HVjLfazL40sy/MbG5tlikiIiJyqGKiQtx+1uGcd1QmAJc8NSvgiiTS1bZl+Svg+8CUGow7xjk30DlXZagWERERqW9x0VHc/f0jaNcyjmWbd3PR4zNZvkUn/UnlahWWnXNLnHNL66oYERERkYYQHRXiuSuHMqpnOjNW5nLeozPYsktdMuS7GqrPsgPGm9k8M7umgZYpIiIiUqXe7Vvy/JVDGdqlNdsLSvjR8/PYurso6LIkwhwwLJvZx2b2VSV/Zx3Eco51zh0JnAr81MxGVbO8a8xsrpnNzcnJOYhFiIiIiBy8V689mozUBBas28GvXl9AYUlZ0CVJBDlgWHbOjXXO9avk7+2aLsQ5t97/vwV4ExhazbiPO+cGO+cGp6en13QRIiIiIofEzPj4F8eR2SqBSUtz+M2bXwZdkkSQeu+GYWZJZpa89zZwEt6JgSIiIiIRISE2ipeuPpoWcdG8NX89x/5F12EWT20vHXe2mWUDw4H3zewj//GOZvaBP1o7YJqZLQBmA+875z6szXJFRERE6lrnNom8cNVQTjuiA9nb93Duv6aTV6DA3NyZcy7oGqo0ePBgN3euLsssIiIiDSdvTwmnPDiFjXmFnNi3HQ9fNIj4mKigy5J6ZGbzqrq8sX7BT0RERCRMSkIM42/yrkUwYfFmXp2zjpKy8oCrkqAoLIuIiIhUkBwfwxd/OBGAP76ziBtf/SLgiiQoCssiIiIilUhNjOXlHx3NwE6pvL9wIy/MWB10SRIAhWURERGRKgzv1oZfnNgTgN+/vYjZq7YFXJE0NIVlERERkWqM6pnO9WO6A3DzfxZQUFwacEXSkBSWRURERA7gVyf34vJjurAmt4Chd33CrkJdUq65UFgWERERqYHrj+/OpcOz2F1Uym/e1O+rNRcKyyIiIiI1kNYibl93jHcXbGDJxp0BVyQNQWFZREREpIbatoznvvMGAHDa36dSVFoWcEVS3xSWRURERA7CuEEZ/PDoLJyDs/8xnbW5BUGXJPVIYVlERETkIESFjOuP786p/dqzeONOnpi6knXbFJibKoVlERERkYPUrmU8958/kKTYKF6YuYY/vrMo6JKknigsi4iIiByChNgoPrv1eEb2SGPa8q2c/9gM9WFughSWRURERA5RamIs1x3XjYGZqcxetY1X56xjwbodQZcldUhhWURERKQWRnRP4zen9wHgD28v4txHp1NYohbmpkJhWURERKSWBnZKZcrNY7hpbE9KyhyXPzObNbn5QZcldUBhWURERKQOdG6TyPePzGBE9zbMXLmNdxdsIG+Pfha7sVNYFhEREakjnVon8tRlQ4iJMu4dv4yzHpkWdElSSwrLIiIiInUoPiaKV68dzmlHtGfttgKenLoS51zQZckhUlgWERERqWNHdm7F9wdlEhMV4s73l7Bqq/ovN1YKyyIiIiL1YGzfdjx+6WAALn5iFr96fUHAFcmhUFgWERERqSdHZbXi4mGdSY6P5v2FG1mRs5uycnXJaEwUlkVERETqSYu4aP589hGcNziTPSVlnHDfZB6YsCzosuQgKCyLiIiI1LOLh2Xxzx8cSXpyHFOXb+Wt+espVwtzo6CwLCIiIlLPWsRFc9r/t3fnQVLWdx7HP9++ZoY5GC5xmGE4Ah6IHAqKJloasUQ04qqJ6GpEk3V3E6PZyw1eG3WTrV23jOXqxnWj8VjLYxOjGPEYs3grl4CC3BAORWAYmYu5eua3f0xjEKel7ZnuXx/vVxXl9PSj/alvPTV8/M3veZ5jKzShsr9WbNurHz+5XCu281jsbEBZBgAASJP/uvx4PXLVCZKkO2vWafk2CnOmoywDAACkSSgY0OTqco2vLNPbG/foiUVbfUfCIVCWAQAA0qi0MKzf/+gUjRpcrBdWfqJZ97yptzbU+o6FOCjLAAAAHsw5eaSOHzFAqz5u0Ktrd/mOgzgoywAAAB5cNm2EHpwzVYNLCvTwO1t01UOLio8P2gAAEKRJREFUfUdCDyjLAAAAHs2deZSOHFqqN9fXyjluJ5dpKMsAAAAezZpUqXMmVKi9s0uTb6/R3n3tviPhAJRlAAAAzy44rlIXHFepvfs6tKm2mQeWZBDKMgAAgGeHlRZq9tRqSdIF//m2rvj1Is+JsB9lGQAAIANMri7XbbOO0aTh5Vq9o1FdXY49zBmAsgwAAJABwsGAvnvSSJ04eqBqm9o0+ob5uui+d3zHynsh3wEAAADwJ5dPG6GSSEhvbqjV0i2fyjknM/MdK2+xsgwAAJBBqgb004/OGKszxw1VtMtp0m01uv/1jb5j5S3KMgAAQAY6Z0KFvveNUQoHA3pn4x7fcfIWZRkAACADVfQv0s3njtPRFaVatLlOc369SK0dnb5j5R3KMgAAQAabPbVao4eU6NW1u7W5ttl3nLxDWQYAAMhg50yo0NyZR0mS7qxZp5ufWan1Oxs9p8of3A0DAAAgwx0xtFSjBhdr2dZPVdvUrqJIUDfMPNp3rLzAyjIAAECGG1xSoAV/f5qW3HSmhpYVaNXH9Vq9o8F3rLxAWQYAAMgiVQP66a0Ne3TZrxb6jpIXKMsAAABZ5KErp+qyadXa09yuzi4eh51qlGUAAIAsUloY1teGlEiSxtw4X9c+vsxzotzGBX4AAABZ5lsTh6mhJaoXVu7Qiu17fcfJaawsAwAAZJnBJQW6bvpYTRk5QDsbWnXLsyv17iae8pcKlGUAAIAsNXXkQBVHQnps4Vbd99pG33FyEmUZAAAgS82aVKmlN5+paaMH6tPmdu1ubFNHZ5fvWDmFsgwAAJDlyvtFtGJ7vab+7BX9ObeU61Nc4AcAAJDlrj/rSE0bPUjzln+kP9Y2+46TU1hZBgAAyHIjBhXr8mkjNLGqXHv3deiOl9bo9XW7fcfKCZRlAACAHDG+sr8k6d4FG/Xz+as9p8kNlGUAAIAccf7kSq372dm68LgqNbZGfcfJCZRlAACAHFNSENSO+haddscCff/hJb7jZDUu8AMAAMgxFxxXpYbWqNZ80qhXVu9Ue7RLkRBrpMlgagAAADlm4vBy/eLiSfr28VWSpK11+9TUxraMZFCWAQAAclT/orAkafqdr+m422u0u7HNc6LswzYMAACAHHX2sYer0zmt2LZXjy3cqp0NrRpSWuA7VlZhZRkAACBH9YuE9J0pwzVj/OGSpGXb9mrNJw2eU2UXVpYBAABy3KDi7tXkm59ZKUmq+ZtTNXZoqc9IWYOVZQAAgBx3dEWpfveDk3XjzKMlSbub2LucKFaWAQAAcpyZaXL1AJmZJGnZ1r2Sup/4V1YY9hkt41GWAQAA8sSg4ogk6Y6X1kqSvjOlSv920USfkTIeZRkAACBPDB/YTy9cd4rqWzo09+kPVNfc7jtSxqMsAwAA5JGjK8okSQOLI9rT3K4Nu5pUNaBIheGg52SZiQv8AAAA8lD/orCWbd2r6Xe+pmsfX+Y7TsaiLAMAAOSh22Ydo7svmaxxFWXayZP94qIsAwAA5KGqAf103sRhGj6wSK3tnb7jZCzKMgAAQB4rCge1flejJt32ss65+w11dHb5jpRRuMAPAAAgj1359VHqXxTWmk8atXBznepbOjS4pMB3rIxBWQYAAMhjE4eXa+Lwcj25eKsWbq5TawdbMg5EWQYAAMBnt46765X1Ki8K68Ljqz67zVw+oywDAABAYw8r1eCSiF5c+Yma2qJqbo/qXy6Y4DuWd5RlAAAAaNywMi256UxJ0ml3LFBzG9sxJO6GAQAAgIMURUJqbouqPdol55zvOF5RlgEAAPA5JQVB/WHNLh1x0wu6+tGlvuN41auybGZ3mNkaM3vfzH5nZuVxjpthZmvNbIOZ/aQ3nwkAAIDUumHm0fqHs47UuIoyrdvZ6DuOV71dWa6RNN45N0HSOklzDz7AzIKS7pV0tqRxki4xs3G9/FwAAACkyOTqAfrh6WM0oaq/WvL86X69KsvOuZedc9HYy3clVfVw2AmSNjjnNjnn2iU9IWlWbz4XAAAAqVcYDqqxNaoXV+7Qiyt3aEd9i+9IadeXd8O4StKTPXy/UtK2A15vl3RivP+ImV0t6WpJqq6u7sN4AAAA+CqGlhWqpaNTf/U/70mSzjjqMD0wZ6rnVOl1yLJsZq9IOryHt250zj0bO+ZGSVFJj/U2kHPufkn3S9KUKVPy+/JLAAAAj64+dbS+edRh6nJOc5/+QPUtHb4jpd0hy7JzbvqXvW9mcySdK+kM1/O9RT6SNPyA11Wx7wEAACCDBQOmIw8vlSQNLI5oV2Or50Tp19u7YcyQdL2k85xz++IctljSWDMbZWYRSbMlzevN5wIAACC9CsMB7Wxo00Nvbdazyz/Km/sv9/ZuGPdIKpVUY2bLzew+STKzYWY2X5JiFwBeI+klSaslPeWcW9XLzwUAAEAajRhUrN2Nbfrpcx/quieWa8ueeOukuaVXF/g558bE+f7HkmYe8Hq+pPm9+SwAAAD4c/1ZR+rqU0Zrwdpd+tunVqipLXrofykH8AQ/AAAAHJKZaUBxRAOLI5KktmiX50Tp0Ze3jgMAAECOKwgFJUlvrN+tXQ2tKowEdcqYwQoFc3MNlrIMAACAhB1WViBJuuuV9Z9975GrTtCpRwzxFSmlKMsAAABI2NeGlOjNfzxdTW1RbdmzT3/56FI1tObu/ZcpywAAAPhKqgb0kyQVxrZktOfw/uXc3FwCAACAlCsId1fJXL7Yj7IMAACApBSFu1eW5z79gUbPfV4z7nrdc6K+xzYMAAAAJKW8X0Q//7NjtaO+RYs212nh5jp1dTkFAuY7Wp+hLAMAACBpl55YLUn65asbtXBzndo7u1QYCHpO1XfYhgEAAIBeCwe7V5Nzbf8yK8sAAADotYJQ9xrs/y7ZptLCkI6tLNe4YWWeU/UeZRkAAAC9VtG/SJL0z8+vliQdM6xMz197is9IfYKyDAAAgF6bPm6oFt84XR2dXbrl2VXatLvJd6Q+QVkGAABAnxhS2v0o7LKikNo7c2PvMhf4AQAAoE9FggF15EhZZmUZAAAAfSoSCqihJaq7/7BekjRj/OE6Ymip51TJoSwDAACgT409rEQtHZ26s2adJGlzbbN+cfEkz6mSQ1kGAABAn7r8pJG69MQRkqSz7npdbdFOz4mSR1kGAABAnwvGHnkdDgbU0ek8p0keF/gBAAAgZcJBy+qL/VhZBgAAQMqEgwHVNrXp7Y21Cphp0vByFYaDvmMljJVlAAAApEx5UVgrP2rQpf+9ULPvf1cPvLnZd6SvhJVlAAAApMy/f3ui1u5slCRd8eAi1bd0eE701VCWAQAAkDIDiiOaNnqQJKkgFFB7NLv2L7MNAwAAAGkRCWXfk/0oywAAAEiLcBY+BpuyDAAAgLSIhAJ6d1Od5j79vnY2tPqOkxDKMgAAANLitCOGqKOzS48v2qY31tf6jpMQyjIAAADS4tZZ4/X0D06WpKzZjkFZBgAAQNqEAt31M0pZBgAAAD4vEuyun+2dznOSxFCWAQAAkDbhkEnKnpVlHkoCAACAtNm/DePlD3eqrrldV5w8UsPKizynio+yDAAAgLQJB00Tq/przY4GLd3yqQaXFOgvTh3tO1ZcbMMAAABA2piZnr3mG3rvljMlSe0Zvh2DsgwAAIC0C8e2Y2T6LeQoywAAAEi7QMAUMCma4XfFoCwDAADAi1AwoI4uVpYBAACALwgHjJVlAAAAoCfhUEAPv/1HHX97jVZ9XO87To8oywAAAPDip986RudPrtSe5nZt2t3sO06PKMsAAADw4vzJlfrh6WMkSdEM3btMWQYAAIA3ocD+x19n5t5lyjIAAAC8CQVjZbmLsgwAAAB8TjBAWQYAAAB6tP9JftEMfZIfZRkAAADe7N+GcetzH+rZ5R95TvNFId8BAAAAkL+KIyGdN3GYttTtU1tH5q0uU5YBAADgTSBguvuSyb5jxMU2DAAAACAOyjIAAAAQB2UZAAAAiIOyDAAAAMRBWQYAAADioCwDAAAAcVCWAQAAgDgoywAAAEAclGUAAAAgDsoyAAAAEAdlGQAAAIiDsgwAAADEQVkGAAAA4qAsAwAAAHFQlgEAAIA4KMsAAABAHJRlAAAAIA7KMgAAABCHOed8Z4jLzHZL2uLhowdLqvXwufmA2aYW800dZptazDd1mG3qMNvUSud8RzjnhvT0RkaXZV/MbIlzborvHLmI2aYW800dZptazDd1mG3qMNvUypT5sg0DAAAAiIOyDAAAAMRBWe7Z/b4D5DBmm1rMN3WYbWox39RhtqnDbFMrI+bLnmUAAAAgDlaWAQAAgDjyuiyb2QwzW2tmG8zsJz28X2BmT8beX2hmI9OfMjslMNs5ZrbbzJbH/nzfR85sZGYPmtkuM1sZ530zs7tjs3/fzI5Ld8ZslcBsTzOz+gPO21vSnTFbmdlwM1tgZh+a2Sozu66HYzh3k5TgfDl/k2BmhWa2yMxWxGZ7aw/H0BeSlOB8vXaGUDo/LJOYWVDSvZLOlLRd0mIzm+ec+/CAw74n6VPn3Bgzmy3pXyVdnP602SXB2UrSk865a9IeMPs9JOkeSY/Eef9sSWNjf06U9MvYP3FoD+nLZytJbzjnzk1PnJwSlfR3zrn3zKxU0lIzqzno5wLnbvISma/E+ZuMNknfdM41mVlY0ptm9oJz7t0DjqEvJC+R+UoeO0M+ryyfIGmDc26Tc65d0hOSZh10zCxJD8e+/o2kM8zM0pgxWyUyWyTJOfe6pLovOWSWpEdct3cllZtZRXrSZbcEZoskOed2OOfei33dKGm1pMqDDuPcTVKC80USYudjU+xlOPbn4Au+6AtJSnC+XuVzWa6UtO2A19v1xR8snx3jnItKqpc0KC3pslsis5WkC2O/av2NmQ1PT7S8kOj8kZyTYr8ufMHMjvEdJhvFfkU9WdLCg97i3O0DXzJfifM3KWYWNLPlknZJqnHOxT136QtfXQLzlTx2hnwuy/DrOUkjnXMTJNXoT/9HDmSy99T9SNSJkv5D0jOe82QdMyuR9FtJP3bONfjOk2sOMV/O3yQ55zqdc5MkVUk6wczG+86USxKYr9fOkM9l+SNJB/6fSVXsez0eY2YhSf0l7UlLuux2yNk65/Y459piL38l6fg0ZcsHiZzbSIJzrmH/rwudc/Mlhc1ssOdYWSO2H/G3kh5zzj3dwyGcu71wqPly/vaec26vpAWSZhz0Fn2hD8Sbr+/OkM9lebGksWY2yswikmZLmnfQMfMkXRH7+iJJ/+e4MXUiDjnbg/Yhnqfu/XXoG/MkfTd2Z4Fpkuqdczt8h8oFZnb4/n2IZnaCun+G8hdiAmJze0DSaufcnXEO49xNUiLz5fxNjpkNMbPy2NdF6r54fc1Bh9EXkpTIfH13hry9G4ZzLmpm10h6SVJQ0oPOuVVmdpukJc65eer+wfOomW1Q90U/s/0lzh4JzvZaMztP3Vdw10ma4y1wljGzxyWdJmmwmW2X9E/qviBCzrn7JM2XNFPSBkn7JF3pJ2n2SWC2F0n6azOLSmqRNJu/EBP2dUmXS/ogtjdRkm6QVC1x7vaBRObL+ZucCkkPx+70FJD0lHPu9/SFPpPIfL12Bp7gBwAAAMSRz9swAAAAgC9FWQYAAADioCwDAAAAcVCWAQAAgDgoywAAAEAclGUAAAAgDsoyAAAAEAdlGQAAAIjj/wFOUFG9yDYoDwAAAABJRU5ErkJggg==\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "分割数据，分成测试集和训练集"
      ],
      "metadata": {
        "id": "Hzs8rRHos7a_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "train, test = texts.randomSplit([0.8, 0.2], seed=123)"
      ],
      "metadata": {
        "id": "2gD7e_kPhFNL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "去除停用词"
      ],
      "metadata": {
        "id": "QMcOvXX7tCKb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "stopwords = set(StopWordsRemover.loadDefaultStopWords(\"english\"))"
      ],
      "metadata": {
        "id": "symvJNl8hInr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "上面我们已经演示了一些方法的具体调用细节，并查看了每一步的输出，接下来直接在整个数据集上使用这些方法即可"
      ],
      "metadata": {
        "id": "-rtuADjVtGR5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "sw_remover = StopWordsRemover() \\\n",
        "    .setInputCol(\"normalized\") \\\n",
        "    .setOutputCol(\"filtered\") \\\n",
        "    .setStopWords(list(stopwords))\n",
        "\n",
        "count_vectorizer = CountVectorizer(inputCol='filtered', \n",
        "    outputCol='tf', minDF=10)\n",
        "idf = IDF(inputCol='tf', outputCol='tfidf', minDocFreq=10)\n",
        "\n",
        "text_processing_pipeline = Pipeline(stages=[\n",
        "        assembler, \n",
        "        sentence, \n",
        "        tokenizer, \n",
        "        lemmatizer, \n",
        "        normalizer, \n",
        "        finisher, \n",
        "        sw_remover,\n",
        "        count_vectorizer,\n",
        "        idf\n",
        "    ])"
      ],
      "metadata": {
        "id": "Xy-0CUNihKRh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from pyspark.ml.feature import IndexToString, StringIndexer\n",
        "from pyspark.ml.classification import *\n",
        "from pyspark.ml.tuning import *\n",
        "from pyspark.ml.evaluation import *"
      ],
      "metadata": {
        "id": "2HSmY8CdhL5H"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "这里使用最简单的朴素贝叶斯的方法来进行分类"
      ],
      "metadata": {
        "id": "Dp5qZFA_tpaD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "label_indexer = StringIndexer(inputCol='newsgroup', outputCol='label').fit(texts)\n",
        "naive_bayes = NaiveBayes(featuresCol='tfidf')\n",
        "prediction_deindexer = IndexToString(inputCol='prediction', outputCol='pred_newsgroup', \n",
        "                                     labels=label_indexer.labels)\n",
        "\n",
        "pipeline = Pipeline(stages=[\n",
        "    text_processing_pipeline, label_indexer, naive_bayes, prediction_deindexer\n",
        "])"
      ],
      "metadata": {
        "id": "IomCEfIjhNu5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "构建模型"
      ],
      "metadata": {
        "id": "GSjWmKHTtwdP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "model = pipeline.fit(train)"
      ],
      "metadata": {
        "id": "tct-2q08hPly"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "分别在测试集和训练集上查看模型的效果"
      ],
      "metadata": {
        "id": "h8SPAYbvtv5Q"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "train_predicted = model.transform(train)\n",
        "test_predicted = model.transform(test)"
      ],
      "metadata": {
        "id": "yn7NsQlshRFl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "evaluator = MulticlassClassificationEvaluator(metricName='f1')"
      ],
      "metadata": {
        "id": "O1wcqAgPhTNZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print('f1', evaluator.evaluate(train_predicted))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lnn3Mp4hhVKD",
        "outputId": "39cbdf4e-f18c-495b-fc89-40fc7a81f611"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "f1 0.9368578470322438\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print('f1', evaluator.evaluate(test_predicted))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mpXRBwxfhXqO",
        "outputId": "a76619cd-3236-4eca-9442-6c79da30dd5f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "f1 0.6096141991442257\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "此处我们使用f1分数来评判最终模型的好坏，因为F1分数是准确度和召回率的调和平均值。可以看到我们的模型在训练集上得到了0.937的分数，然而在测试集上仅仅得到0.610的分数，可能是我们的模型过拟合了。"
      ],
      "metadata": {
        "id": "5Q0CSVq1t6ba"
      }
    }
  ]
}